Commit 3d70a15b authored by Koen van der Veen's avatar Koen van der Veen
Browse files

Merge branch 'readme' into 'dev'

Update of the readme

See merge request memri/pyintegrators!10
parents 0afb1551 3a9c8dc0
Showing with 1160 additions and 496 deletions
+1160 -496
nbs/Test.ipynb
nbs/_test.ipynb
*.bak
......
default:
image: python:3.6
before_script:
- apt-get update && apt-get install -y libsqlcipher-dev
- pip install nbdev jupyter
- apt-get update && apt-get install -y libsqlcipher-dev && apt-get install -y libgl1-mesa-glx
- pip install nbdev==1.0.4 jupyter fastcore==1.0.0
- pip install -e .
read_notebooks:
......
# Integrators
> Integrators integrate your information in the pod. They import your data from external services (gmail, whatsapp, icloud, facebook etc.), enrich your data with indexers (face recognition, spam detection, duplicate photo detection), and execute actions (sending mails, automatically share selected photo's with your family).
> Integrators integrate your information in your Pod. They import data from external services (Gmail, WhatsApp, etc.), enrich data with indexers (face recognition, spam detection, etc.), and execute actions (sending messages, uploading files, etc.).
# Overview
We start by listing the existing indexers and their functionalities, make sure to check out their pages for usage examples.
| Integrator | Description | Tests passing |
|------------|-------------|---------------|
|`FaceRecognitionIndexer`|Recognizes photos from faces.| ![alt text](https://gitlab.memri.io/memri/pyintegrators/-/raw/face-indexer/assets/build-passing.svg "Logo Title Text 1")|
|`GeoIndexer`|Adds Countries and Cities to items with a location.| ![alt text](https://gitlab.memri.io/memri/pyintegrators/-/raw/face-indexer/assets/build-passing.svg "Logo Title Text 1")|
|`NotesListIndexer`|Extracts lists from notes and categorizes them.| ![alt text](https://gitlab.memri.io/memri/pyintegrators/-/raw/face-indexer/assets/build-passing.svg "Logo Title Text 1")|
|`FaceRecognitionIndexer`|Recognizes photos from faces.| ![Build passing](https://gitlab.memri.io/memri/pyintegrators/-/raw/prod/assets/build-passing.svg "Build passing")|
|`GeoIndexer`|Adds Countries and Cities to items with a location.| ![Build passing](https://gitlab.memri.io/memri/pyintegrators/-/raw/prod/assets/build-passing.svg "Build passing")|
|`GmailImporter`|Imports email from GMail.| ![Build passing](https://gitlab.memri.io/memri/pyintegrators/-/raw/prod/assets/build-passing.svg "Build passing")|
|`NotesListIndexer`|Extracts lists from notes and categorizes them.| ![Build passing](https://gitlab.memri.io/memri/pyintegrators/-/raw/prod/assets/build-passing.svg "Build passing")|
Integrators for memri have a single repo per language, this repo is the one for python, but other repo's exist for [node](https://gitlab.memri.io/memri/nodeintegrators) and we are planning to create one for rust. This repo is built with [nbdev](https://github.com/fastai/nbdev) and therefore all code/documentation/tests are written in one place as jupyter notebooks and exported to a python-package/jekyll-website/unit-tests.
## Install
Integrators for Memri have a single repository per language, this repository is the one for Python, but others exist for [Node.js](https://gitlab.memri.io/memri/nodeintegrators) and [Rust](https://gitlab.memri.io/memri/rustintegrators). This repository makes use of [nbdev](https://github.com/fastai/nbdev), which means all code, documentation and tests are made in Jupyter Notebooks and exported to a Python package, a Jekyll documentation and unit tests.
## Using Docker
Integrators are invoked by the Pod by launching a Docker container. To build the images for these containers, run:
```bash
docker build -t memri-pyintegrators .
```
pip install -e integrators
nbdev_install_git_hooks
```
This last command clears your notebooks of unnecessary metadata when making a commit.
## Build
To enable calling integrators from the [pod](https://gitlab.memri.io/memri/pod) the integrator docker containers needs to be built. *You can skip this if you are developing an indexer locally and you don't want to integrate with the pod yet.* To build, run:
```
./examples/build.sh
## Local build
### Install
To install the Python package:
```bash
pip install -e .
```
Now, the pod is able to find the integrator container when calling it.
## How to develop with nbdev
The python integrators are written in nbdev. With nbdev, you use jupyter notebooks as a single source of truth, and generate the library, documentation and tests from the notebooks. The [nbdev website](https://github.com/fastai/nbdev) contains great documentation that will help you understand how to develop with it. If you don't want to read that, the most important things to get you started are:
If you want to contribute, you have to clean the Jupyter Notebooks every time before you push code to prevent conflicts
in the Notebooks' metadata. A script to do so can be installed using:
```bash
nbdev_install_git_hooks
```
- Add `#export` flags to the cells that define the functions you want to include in your python modules.
- Add `#default_exp <packagename>.<modulename>` to the top of your notebook to define the python module to export to.
- All cell's that are not exported are tests by default
### Jupyter Notebooks
The Python integrators are written in nbdev. With nbdev, you write all code in
[Jupyter Notebooks](https://jupyter.readthedocs.io/en/latest/install/notebook-classic.html), and generate the library, documentation and tests using the nbdev CLI.
When you are done writing your code in notebooks, call `nbdev_build_lib` to convert the notebooks to code and tests. Call `nbdev_build_docs` to generate the docs.
### nbdev
With nbdev we create the code in Notebooks, where we specify the use off cells using special tags. See the [nbdev documentation](https://nbdev.fast.ai/) for a all functionalities and tutorials, the most important tags are listed below.
### Run tests
#### nbdev tags
- Notebooks that start their name with an underscore, are ignored by nbdev completely
- Add `#default_exp <packagename>.<modulename>` to the top of your notebook to define the Python module to export to
- Add `#export` to the cells that define functions to include in the Python modules.
- All cells without the `#export` tag, are tests by default
- All cells are included in the documentation, unless you add the keyword `#hide`
Every cell without the `#export` flag will be a test. So make sure that the code in notebooks runs fast and without errors. You can run all tests by calling.
#### nbdev CLI
After developing your code in Notebooks, you can use the nbdev CLI:
- `nbdev_build_lib` to convert the Notebooks to the library and tests
- `nbdev_test_nbs` to run the tests
- `nbdev_build_docs` to generate the docs
- `nbdev_clean_nbs` to clean the Notebooks' metadata to prevent Git conflicts
```
nbdev_test_nbs
```
### Contributing
Before you make a merge request, make sure that you used all the nbdev commands specified above, or GitLab's CI won't pass.
## Docs
Find the online docs at [pyintegrators.docs.memri.io](https://pyintegrators.docs.memri.io/).
If you want to hide certain functionality in the docs, you can use the `#hide` flag in the top of a cell
### Render documentation locally
New documentation will be deployed automatically when a new version is released to the `prod` branch. To inspect the documentation beforehand, you can run it local machine by [installing Jekyll](https://jekyllrb.com/docs/installation/).
### Render docs locally
Often you might want to check your docs locally before deploying them. To do so, first install Jekyll:
```
gem install bundler jekyll
To build the documentation:
```bash
cd docs
gem update --system
bundle install
```
Then, run the Jekyll server:
```
cd docs
To serve the documentation:
```bash
bundle exec jekyll serve
```
And thats it!
......@@ -20,7 +20,10 @@ entries:
- folderitems:
- output: web,pdf
title: Overview
url: indexers.indexer.html
url: importers.Importer.html
- output: web,pdf
title: GmailImporter
url: importers.GmailImporter.html
output: web
title: Importers
- folderitems:
......
---
title: Title
keywords: fastai
sidebar: home_sidebar
nb_path: "nbs/gmail.ipynb"
---
<!--
#################################################
### THIS FILE WAS AUTOGENERATED! DO NOT EDIT! ###
#################################################
# file to edit: nbs/gmail.ipynb
# command to build the docs after a change: nbdev_build_docs
-->
<div class="container" id="notebook-container">
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
</div>
{% endraw %}
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
</div>
{% endraw %}
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
</div>
{% endraw %}
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_markdown rendered_html output_subarea ">
<h2 id="IMAPClient" class="doc_header"><code>class</code> <code>IMAPClient</code><a href="https://gitlab.memri.io/memri/integrators/tree/master/integrators/importers/gmail.py#L13" class="source_link" style="float:right">[source]</a></h2><blockquote><p><code>IMAPClient</code>(<strong><code>username</code></strong>, <strong><code>app_pw</code></strong>, <strong><code>host</code></strong>=<em><code>'imap.gmail.com'</code></em>, <strong><code>port</code></strong>=<em><code>993</code></em>, <strong><code>inbox</code></strong>=<em><code>'"[Gmail]/All Mail"'</code></em>)</p>
</blockquote>
</div>
</div>
</div>
</div>
</div>
{% endraw %}
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_markdown rendered_html output_subarea ">
<h4 id="get_message_content" class="doc_header"><code>get_message_content</code><a href="https://gitlab.memri.io/memri/integrators/tree/master/integrators/importers/gmail.py#L92" class="source_link" style="float:right">[source]</a></h4><blockquote><p><code>get_message_content</code>(<strong><code>message</code></strong>)</p>
</blockquote>
</div>
</div>
</div>
</div>
</div>
{% endraw %}
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_markdown rendered_html output_subarea ">
<h4 id="get_addresses_from_message" class="doc_header"><code>get_addresses_from_message</code><a href="https://gitlab.memri.io/memri/integrators/tree/master/integrators/importers/gmail.py#L108" class="source_link" style="float:right">[source]</a></h4><blockquote><p><code>get_addresses_from_message</code>(<strong><code>message</code></strong>, <strong><code>field</code></strong>)</p>
</blockquote>
</div>
</div>
</div>
</div>
</div>
{% endraw %}
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_markdown rendered_html output_subarea ">
<h4 id="get_timestamp_from_message" class="doc_header"><code>get_timestamp_from_message</code><a href="https://gitlab.memri.io/memri/integrators/tree/master/integrators/importers/gmail.py#L114" class="source_link" style="float:right">[source]</a></h4><blockquote><p><code>get_timestamp_from_message</code>(<strong><code>message</code></strong>)</p>
</blockquote>
</div>
</div>
</div>
</div>
</div>
{% endraw %}
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
</div>
{% endraw %}
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_markdown rendered_html output_subarea ">
<h4 id="create_item_from_mail" class="doc_header"><code>create_item_from_mail</code><a href="https://gitlab.memri.io/memri/integrators/tree/master/integrators/importers/gmail.py#L124" class="source_link" style="float:right">[source]</a></h4><blockquote><p><code>create_item_from_mail</code>(<strong><code>mail</code></strong>, <strong><code>thread_id</code></strong>=<em><code>None</code></em>)</p>
</blockquote>
</div>
</div>
</div>
</div>
</div>
{% endraw %}
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_markdown rendered_html output_subarea ">
<h4 id="download_mails" class="doc_header"><code>download_mails</code><a href="https://gitlab.memri.io/memri/integrators/tree/master/integrators/importers/gmail.py#L166" class="source_link" style="float:right">[source]</a></h4><blockquote><p><code>download_mails</code>(<strong><code>imap_client</code></strong>, <strong><code>gmail_ids</code></strong>, <strong><code>stop_at</code></strong>)</p>
</blockquote>
</div>
</div>
</div>
</div>
</div>
{% endraw %}
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_markdown rendered_html output_subarea ">
<h4 id="merge_duplicate_items" class="doc_header"><code>merge_duplicate_items</code><a href="https://gitlab.memri.io/memri/integrators/tree/master/integrators/importers/gmail.py#L184" class="source_link" style="float:right">[source]</a></h4><blockquote><p><code>merge_duplicate_items</code>(<strong><code>all_mails</code></strong>)</p>
</blockquote>
</div>
</div>
</div>
</div>
</div>
{% endraw %}
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
</div>
{% endraw %}
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_markdown rendered_html output_subarea ">
<h2 id="GmailImporter" class="doc_header"><code>class</code> <code>GmailImporter</code><a href="" class="source_link" style="float:right">[source]</a></h2><blockquote><p><code>GmailImporter</code>(<strong>*<code>args</code></strong>, <strong>**<code>kwargs</code></strong>) :: <a href="/integrators/indexers.indexer.html#ImporterBase"><code>ImporterBase</code></a></p>
</blockquote>
<p>Imports email from GMail.</p>
</div>
</div>
</div>
</div>
</div>
{% endraw %}
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Store your credentials in this file:</span>
<span class="n">file</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;tmp/credentials_gmail.txt&#39;</span><span class="p">,</span><span class="s1">&#39;r&#39;</span><span class="p">)</span>
<span class="n">imap_host</span> <span class="o">=</span> <span class="s1">&#39;imap.gmail.com&#39;</span>
<span class="n">imap_user</span> <span class="o">=</span> <span class="n">file</span><span class="o">.</span><span class="n">readline</span><span class="p">()</span><span class="o">.</span><span class="n">strip</span><span class="p">(</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span><span class="p">)</span>
<span class="n">imap_pw</span> <span class="o">=</span> <span class="n">file</span><span class="o">.</span><span class="n">readline</span><span class="p">()</span><span class="o">.</span><span class="n">strip</span><span class="p">(</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span><span class="p">)</span>
<span class="n">pod_client</span> <span class="o">=</span> <span class="n">PodClient</span><span class="p">()</span>
<span class="n">pod_client</span><span class="o">.</span><span class="n">delete_all</span><span class="p">()</span>
<span class="n">importer_run</span> <span class="o">=</span> <span class="n">ImporterRun</span><span class="o">.</span><span class="n">from_data</span><span class="p">(</span><span class="n">progress</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">username</span><span class="o">=</span><span class="n">imap_user</span><span class="p">,</span> <span class="n">password</span><span class="o">=</span><span class="n">imap_pw</span><span class="p">)</span>
<span class="n">importer</span> <span class="o">=</span> <span class="n">GmailImporter</span><span class="o">.</span><span class="n">from_data</span><span class="p">()</span>
<span class="n">importer</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">importer_run</span><span class="o">=</span><span class="n">importer_run</span><span class="p">,</span> <span class="n">pod_client</span><span class="o">=</span><span class="n">pod_client</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stderr output_text">
<pre>UsageError: Line magic function `%nbdev_slow_test` not found.
</pre>
</div>
</div>
</div>
</div>
</div>
{% endraw %}
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">test</span> <span class="o">=</span> <span class="sa">b</span><span class="s2">&quot;&quot;&quot;</span><span class="se">\</span>
<span class="s2">Message-id: 1234</span><span class="se">\r</span><span class="s2"></span>
<span class="s2">From: user1 &lt;a@gmail.com&gt;</span><span class="se">\r</span><span class="s2"></span>
<span class="s2">To: user1 &lt;b@gmail.com&gt;</span><span class="se">\r</span><span class="s2"></span>
<span class="s2">Reply-to: user1 &lt;c@gmail.com&gt;</span><span class="se">\r</span><span class="s2"></span>
<span class="s2">Subject: the subject</span><span class="se">\r</span><span class="s2"></span>
<span class="s2">Date: Mon, 04 May 2020 00:37:44 -0700</span><span class="se">\r</span><span class="s2"></span>
<span class="s2">This is content&quot;&quot;&quot;</span>
<span class="c1">#mail_message = email.message_from_string(test)</span>
<span class="n">mail_item</span> <span class="o">=</span> <span class="n">create_item_from_mail</span><span class="p">(</span><span class="n">test</span><span class="p">,</span> <span class="s1">&#39;message_channel_id&#39;</span><span class="p">)</span>
<span class="k">assert</span> <span class="n">mail_item</span><span class="o">.</span><span class="n">externalId</span> <span class="o">==</span> <span class="s1">&#39;1234&#39;</span>
<span class="k">assert</span> <span class="n">mail_item</span><span class="o">.</span><span class="n">sender</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">externalId</span> <span class="o">==</span> <span class="s1">&#39;a@gmail.com&#39;</span>
<span class="k">assert</span> <span class="n">mail_item</span><span class="o">.</span><span class="n">receiver</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">externalId</span> <span class="o">==</span> <span class="s1">&#39;b@gmail.com&#39;</span>
<span class="k">assert</span> <span class="n">mail_item</span><span class="o">.</span><span class="n">replyTo</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">externalId</span> <span class="o">==</span> <span class="s1">&#39;c@gmail.com&#39;</span>
<span class="k">assert</span> <span class="n">mail_item</span><span class="o">.</span><span class="n">subject</span> <span class="o">==</span> <span class="s1">&#39;the subject&#39;</span>
<span class="k">assert</span> <span class="n">mail_item</span><span class="o">.</span><span class="n">content</span> <span class="o">==</span> <span class="s1">&#39;This is content&#39;</span>
<span class="k">assert</span> <span class="n">mail_item</span><span class="o">.</span><span class="n">dateSent</span> <span class="o">==</span> <span class="n">get_timestamp_from_message</span><span class="p">(</span><span class="n">email</span><span class="o">.</span><span class="n">message_from_bytes</span><span class="p">(</span><span class="n">test</span><span class="p">))</span>
<span class="k">assert</span> <span class="n">mail_item</span><span class="o">.</span><span class="n">messageChannel</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">externalId</span> <span class="o">==</span> <span class="s1">&#39;message_channel_id&#39;</span>
</pre></div>
</div>
</div>
</div>
</div>
{% endraw %}
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">message</span> <span class="o">=</span> <span class="n">email</span><span class="o">.</span><span class="n">message</span><span class="o">.</span><span class="n">EmailMessage</span><span class="p">()</span>
<span class="n">message</span><span class="o">.</span><span class="n">set_content</span><span class="p">(</span><span class="s1">&#39;aa&#39;</span><span class="p">)</span>
<span class="n">message</span><span class="o">.</span><span class="n">add_attachment</span><span class="p">(</span><span class="sa">b</span><span class="s1">&#39;bb&#39;</span><span class="p">,</span> <span class="n">maintype</span><span class="o">=</span><span class="s1">&#39;image&#39;</span><span class="p">,</span> <span class="n">subtype</span><span class="o">=</span><span class="s1">&#39;jpeg&#39;</span><span class="p">,</span> <span class="n">filename</span><span class="o">=</span><span class="s1">&#39;sample.jpg&#39;</span><span class="p">)</span>
<span class="n">message</span><span class="o">.</span><span class="n">add_attachment</span><span class="p">(</span><span class="sa">b</span><span class="s1">&#39;cc&#39;</span><span class="p">,</span> <span class="n">maintype</span><span class="o">=</span><span class="s1">&#39;image&#39;</span><span class="p">,</span> <span class="n">subtype</span><span class="o">=</span><span class="s1">&#39;jpeg&#39;</span><span class="p">,</span> <span class="n">filename</span><span class="o">=</span><span class="s1">&#39;sample2.jpg&#39;</span><span class="p">)</span>
<span class="n">content</span><span class="p">,</span> <span class="n">attachments</span> <span class="o">=</span> <span class="n">get_message_content</span><span class="p">(</span><span class="n">message</span><span class="p">)</span>
<span class="k">assert</span> <span class="n">content</span> <span class="o">==</span> <span class="s1">&#39;aa</span><span class="se">\n</span><span class="s1">&#39;</span>
<span class="k">assert</span> <span class="n">attachments</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">get_content</span><span class="p">()</span> <span class="o">==</span> <span class="sa">b</span><span class="s1">&#39;bb&#39;</span>
<span class="k">assert</span> <span class="n">attachments</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">get_content</span><span class="p">()</span> <span class="o">==</span> <span class="sa">b</span><span class="s1">&#39;cc&#39;</span>
</pre></div>
</div>
</div>
</div>
</div>
{% endraw %}
</div>
---
title: Importer
keywords: fastai
sidebar: home_sidebar
nb_path: "nbs/importer.Importer.ipynb"
---
<!--
#################################################
### THIS FILE WAS AUTOGENERATED! DO NOT EDIT! ###
#################################################
# file to edit: nbs/importer.Importer.ipynb
# command to build the docs after a change: nbdev_build_docs
-->
<div class="container" id="notebook-container">
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
</div>
{% endraw %}
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Importers download your data from other sources. For retrieving the data from other sources importers rely on downloaders.</p>
</div>
</div>
</div>
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
</div>
{% endraw %}
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_markdown rendered_html output_subarea ">
<h2 id="ImporterBase" class="doc_header"><code>class</code> <code>ImporterBase</code><a href="" class="source_link" style="float:right">[source]</a></h2><blockquote><p><code>ImporterBase</code>(<strong><code>importerClass</code></strong>=<em><code>None</code></em>, <strong>*<code>args</code></strong>, <strong>**<code>kwargs</code></strong>) :: <code>Importer</code></p>
</blockquote>
<p>Provides a base class for all items. All items in the schema inherit from this class, and it provides some
basic functionality for consistency and to enable easier usage.</p>
</div>
</div>
</div>
</div>
</div>
{% endraw %}
</div>
---
title: Gmail Importer
keywords: fastai
sidebar: home_sidebar
nb_path: "nbs/importers.GmailImporter.ipynb"
---
<!--
#################################################
### THIS FILE WAS AUTOGENERATED! DO NOT EDIT! ###
#################################################
# file to edit: nbs/importers.GmailImporter.ipynb
# command to build the docs after a change: nbdev_build_docs
-->
<div class="container" id="notebook-container">
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
</div>
{% endraw %}
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
</div>
{% endraw %}
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
</div>
{% endraw %}
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_markdown rendered_html output_subarea ">
<h2 id="IMAPClient" class="doc_header"><code>class</code> <code>IMAPClient</code><a href="https://gitlab.memri.io/memri/integrators/tree/master/integrators/importers/gmail.py#L13" class="source_link" style="float:right">[source]</a></h2><blockquote><p><code>IMAPClient</code>(<strong><code>username</code></strong>, <strong><code>app_pw</code></strong>, <strong><code>host</code></strong>=<em><code>'imap.gmail.com'</code></em>, <strong><code>port</code></strong>=<em><code>993</code></em>, <strong><code>inbox</code></strong>=<em><code>'"[Gmail]/All Mail"'</code></em>)</p>
</blockquote>
</div>
</div>
</div>
</div>
</div>
{% endraw %}
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_markdown rendered_html output_subarea ">
<h4 id="get_message_content" class="doc_header"><code>get_message_content</code><a href="https://gitlab.memri.io/memri/integrators/tree/master/integrators/importers/gmail.py#L92" class="source_link" style="float:right">[source]</a></h4><blockquote><p><code>get_message_content</code>(<strong><code>message</code></strong>)</p>
</blockquote>
</div>
</div>
</div>
</div>
</div>
{% endraw %}
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_markdown rendered_html output_subarea ">
<h4 id="get_addresses_from_message" class="doc_header"><code>get_addresses_from_message</code><a href="https://gitlab.memri.io/memri/integrators/tree/master/integrators/importers/gmail.py#L108" class="source_link" style="float:right">[source]</a></h4><blockquote><p><code>get_addresses_from_message</code>(<strong><code>message</code></strong>, <strong><code>field</code></strong>)</p>
</blockquote>
</div>
</div>
</div>
</div>
</div>
{% endraw %}
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_markdown rendered_html output_subarea ">
<h4 id="get_timestamp_from_message" class="doc_header"><code>get_timestamp_from_message</code><a href="https://gitlab.memri.io/memri/integrators/tree/master/integrators/importers/gmail.py#L114" class="source_link" style="float:right">[source]</a></h4><blockquote><p><code>get_timestamp_from_message</code>(<strong><code>message</code></strong>)</p>
</blockquote>
</div>
</div>
</div>
</div>
</div>
{% endraw %}
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
</div>
{% endraw %}
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_markdown rendered_html output_subarea ">
<h4 id="create_item_from_mail" class="doc_header"><code>create_item_from_mail</code><a href="https://gitlab.memri.io/memri/integrators/tree/master/integrators/importers/gmail.py#L124" class="source_link" style="float:right">[source]</a></h4><blockquote><p><code>create_item_from_mail</code>(<strong><code>mail</code></strong>, <strong><code>thread_id</code></strong>=<em><code>None</code></em>)</p>
</blockquote>
</div>
</div>
</div>
</div>
</div>
{% endraw %}
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_markdown rendered_html output_subarea ">
<h4 id="download_mails" class="doc_header"><code>download_mails</code><a href="https://gitlab.memri.io/memri/integrators/tree/master/integrators/importers/gmail.py#L166" class="source_link" style="float:right">[source]</a></h4><blockquote><p><code>download_mails</code>(<strong><code>imap_client</code></strong>, <strong><code>gmail_ids</code></strong>, <strong><code>stop_at</code></strong>)</p>
</blockquote>
</div>
</div>
</div>
</div>
</div>
{% endraw %}
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_markdown rendered_html output_subarea ">
<h4 id="merge_duplicate_items" class="doc_header"><code>merge_duplicate_items</code><a href="https://gitlab.memri.io/memri/integrators/tree/master/integrators/importers/gmail.py#L184" class="source_link" style="float:right">[source]</a></h4><blockquote><p><code>merge_duplicate_items</code>(<strong><code>all_mails</code></strong>)</p>
</blockquote>
</div>
</div>
</div>
</div>
</div>
{% endraw %}
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
</div>
{% endraw %}
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_markdown rendered_html output_subarea ">
<h2 id="GmailImporter" class="doc_header"><code>class</code> <code>GmailImporter</code><a href="" class="source_link" style="float:right">[source]</a></h2><blockquote><p><code>GmailImporter</code>(<strong>*<code>args</code></strong>, <strong>**<code>kwargs</code></strong>) :: <a href="/integrators/importers.Importer.html#ImporterBase"><code>ImporterBase</code></a></p>
</blockquote>
<p>Imports email from GMail.</p>
</div>
</div>
</div>
</div>
</div>
{% endraw %}
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Store your credentials in this file:</span>
<span class="n">file</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;tmp/credentials_gmail.txt&#39;</span><span class="p">,</span><span class="s1">&#39;r&#39;</span><span class="p">)</span>
<span class="n">imap_host</span> <span class="o">=</span> <span class="s1">&#39;imap.gmail.com&#39;</span>
<span class="n">imap_user</span> <span class="o">=</span> <span class="n">file</span><span class="o">.</span><span class="n">readline</span><span class="p">()</span><span class="o">.</span><span class="n">strip</span><span class="p">(</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span><span class="p">)</span>
<span class="n">imap_pw</span> <span class="o">=</span> <span class="n">file</span><span class="o">.</span><span class="n">readline</span><span class="p">()</span><span class="o">.</span><span class="n">strip</span><span class="p">(</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span><span class="p">)</span>
<span class="n">pod_client</span> <span class="o">=</span> <span class="n">PodClient</span><span class="p">()</span>
<span class="n">pod_client</span><span class="o">.</span><span class="n">delete_all</span><span class="p">()</span>
<span class="n">importer_run</span> <span class="o">=</span> <span class="n">ImporterRun</span><span class="o">.</span><span class="n">from_data</span><span class="p">(</span><span class="n">progress</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">username</span><span class="o">=</span><span class="n">imap_user</span><span class="p">,</span> <span class="n">password</span><span class="o">=</span><span class="n">imap_pw</span><span class="p">)</span>
<span class="n">importer</span> <span class="o">=</span> <span class="n">GmailImporter</span><span class="o">.</span><span class="n">from_data</span><span class="p">()</span>
<span class="n">importer</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">importer_run</span><span class="o">=</span><span class="n">importer_run</span><span class="p">,</span> <span class="n">pod_client</span><span class="o">=</span><span class="n">pod_client</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stderr output_text">
<pre>UsageError: Line magic function `%nbdev_slow_test` not found.
</pre>
</div>
</div>
</div>
</div>
</div>
{% endraw %}
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">test</span> <span class="o">=</span> <span class="sa">b</span><span class="s2">&quot;&quot;&quot;</span><span class="se">\</span>
<span class="s2">Message-id: 1234</span><span class="se">\r</span><span class="s2"></span>
<span class="s2">From: user1 &lt;a@gmail.com&gt;</span><span class="se">\r</span><span class="s2"></span>
<span class="s2">To: user1 &lt;b@gmail.com&gt;</span><span class="se">\r</span><span class="s2"></span>
<span class="s2">Reply-to: user1 &lt;c@gmail.com&gt;</span><span class="se">\r</span><span class="s2"></span>
<span class="s2">Subject: the subject</span><span class="se">\r</span><span class="s2"></span>
<span class="s2">Date: Mon, 04 May 2020 00:37:44 -0700</span><span class="se">\r</span><span class="s2"></span>
<span class="s2">This is content&quot;&quot;&quot;</span>
<span class="c1">#mail_message = email.message_from_string(test)</span>
<span class="n">mail_item</span> <span class="o">=</span> <span class="n">create_item_from_mail</span><span class="p">(</span><span class="n">test</span><span class="p">,</span> <span class="s1">&#39;message_channel_id&#39;</span><span class="p">)</span>
<span class="k">assert</span> <span class="n">mail_item</span><span class="o">.</span><span class="n">externalId</span> <span class="o">==</span> <span class="s1">&#39;1234&#39;</span>
<span class="k">assert</span> <span class="n">mail_item</span><span class="o">.</span><span class="n">sender</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">externalId</span> <span class="o">==</span> <span class="s1">&#39;a@gmail.com&#39;</span>
<span class="k">assert</span> <span class="n">mail_item</span><span class="o">.</span><span class="n">receiver</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">externalId</span> <span class="o">==</span> <span class="s1">&#39;b@gmail.com&#39;</span>
<span class="k">assert</span> <span class="n">mail_item</span><span class="o">.</span><span class="n">replyTo</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">externalId</span> <span class="o">==</span> <span class="s1">&#39;c@gmail.com&#39;</span>
<span class="k">assert</span> <span class="n">mail_item</span><span class="o">.</span><span class="n">subject</span> <span class="o">==</span> <span class="s1">&#39;the subject&#39;</span>
<span class="k">assert</span> <span class="n">mail_item</span><span class="o">.</span><span class="n">content</span> <span class="o">==</span> <span class="s1">&#39;This is content&#39;</span>
<span class="k">assert</span> <span class="n">mail_item</span><span class="o">.</span><span class="n">dateSent</span> <span class="o">==</span> <span class="n">get_timestamp_from_message</span><span class="p">(</span><span class="n">email</span><span class="o">.</span><span class="n">message_from_bytes</span><span class="p">(</span><span class="n">test</span><span class="p">))</span>
<span class="k">assert</span> <span class="n">mail_item</span><span class="o">.</span><span class="n">messageChannel</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">externalId</span> <span class="o">==</span> <span class="s1">&#39;message_channel_id&#39;</span>
</pre></div>
</div>
</div>
</div>
</div>
{% endraw %}
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">message</span> <span class="o">=</span> <span class="n">email</span><span class="o">.</span><span class="n">message</span><span class="o">.</span><span class="n">EmailMessage</span><span class="p">()</span>
<span class="n">message</span><span class="o">.</span><span class="n">set_content</span><span class="p">(</span><span class="s1">&#39;aa&#39;</span><span class="p">)</span>
<span class="n">message</span><span class="o">.</span><span class="n">add_attachment</span><span class="p">(</span><span class="sa">b</span><span class="s1">&#39;bb&#39;</span><span class="p">,</span> <span class="n">maintype</span><span class="o">=</span><span class="s1">&#39;image&#39;</span><span class="p">,</span> <span class="n">subtype</span><span class="o">=</span><span class="s1">&#39;jpeg&#39;</span><span class="p">,</span> <span class="n">filename</span><span class="o">=</span><span class="s1">&#39;sample.jpg&#39;</span><span class="p">)</span>
<span class="n">message</span><span class="o">.</span><span class="n">add_attachment</span><span class="p">(</span><span class="sa">b</span><span class="s1">&#39;cc&#39;</span><span class="p">,</span> <span class="n">maintype</span><span class="o">=</span><span class="s1">&#39;image&#39;</span><span class="p">,</span> <span class="n">subtype</span><span class="o">=</span><span class="s1">&#39;jpeg&#39;</span><span class="p">,</span> <span class="n">filename</span><span class="o">=</span><span class="s1">&#39;sample2.jpg&#39;</span><span class="p">)</span>
<span class="n">content</span><span class="p">,</span> <span class="n">attachments</span> <span class="o">=</span> <span class="n">get_message_content</span><span class="p">(</span><span class="n">message</span><span class="p">)</span>
<span class="k">assert</span> <span class="n">content</span> <span class="o">==</span> <span class="s1">&#39;aa</span><span class="se">\n</span><span class="s1">&#39;</span>
<span class="k">assert</span> <span class="n">attachments</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">get_content</span><span class="p">()</span> <span class="o">==</span> <span class="sa">b</span><span class="s1">&#39;bb&#39;</span>
<span class="k">assert</span> <span class="n">attachments</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">get_content</span><span class="p">()</span> <span class="o">==</span> <span class="sa">b</span><span class="s1">&#39;cc&#39;</span>
</pre></div>
</div>
</div>
</div>
</div>
{% endraw %}
</div>
---
title: Importer
keywords: fastai
sidebar: home_sidebar
nb_path: "nbs/importers.Importer.ipynb"
---
<!--
#################################################
### THIS FILE WAS AUTOGENERATED! DO NOT EDIT! ###
#################################################
# file to edit: nbs/importers.Importer.ipynb
# command to build the docs after a change: nbdev_build_docs
-->
<div class="container" id="notebook-container">
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
</div>
{% endraw %}
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Importers download your data from other sources. For retrieving the data from other sources importers rely on downloaders.</p>
</div>
</div>
</div>
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
</div>
{% endraw %}
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_markdown rendered_html output_subarea ">
<h2 id="ImporterBase" class="doc_header"><code>class</code> <code>ImporterBase</code><a href="" class="source_link" style="float:right">[source]</a></h2><blockquote><p><code>ImporterBase</code>(<strong><code>importerClass</code></strong>=<em><code>None</code></em>, <strong>*<code>args</code></strong>, <strong>**<code>kwargs</code></strong>) :: <code>Importer</code></p>
</blockquote>
<p>Provides a base class for all items. All items in the schema inherit from this class, and it provides some
basic functionality for consistency and to enable easier usage.</p>
</div>
</div>
</div>
</div>
</div>
{% endraw %}
</div>
......@@ -6,8 +6,8 @@ title: Integrators
keywords: fastai
sidebar: home_sidebar
summary: "Integrators integrate your information in the pod. They import your data from external services (gmail, whatsapp, icloud, facebook etc.), enrich your data with indexers (face recognition, spam detection, duplicate photo detection), and execute actions (sending mails, automatically share selected photo's with your family)."
description: "Integrators integrate your information in the pod. They import your data from external services (gmail, whatsapp, icloud, facebook etc.), enrich your data with indexers (face recognition, spam detection, duplicate photo detection), and execute actions (sending mails, automatically share selected photo's with your family)."
summary: "Integrators integrate your information in your Pod. They import data from external services (Gmail, WhatsApp, etc.), enrich data with indexers (face recognition, spam detection, etc.), and execute actions (sending messages, uploading files, etc.)."
description: "Integrators integrate your information in your Pod. They import data from external services (Gmail, WhatsApp, etc.), enrich data with indexers (face recognition, spam detection, etc.), and execute actions (sending messages, uploading files, etc.)."
nb_path: "nbs/index.ipynb"
---
<!--
......@@ -29,6 +29,13 @@ nb_path: "nbs/index.ipynb"
</div>
{% endraw %}
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Overview">Overview<a class="anchor-link" href="#Overview"> </a></h1><p>We start by listing the existing indexers and their functionalities, make sure to check out their pages for usage examples.</p>
</div>
</div>
</div>
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
......@@ -51,17 +58,22 @@ nb_path: "nbs/index.ipynb"
<tr>
<td><a href="/integrators/indexers.FaceRecognitionIndexer.html#FaceRecognitionIndexer"><code>FaceRecognitionIndexer</code></a></td>
<td>Recognizes photos from faces.</td>
<td><img src="https://gitlab.memri.io/memri/pyintegrators/-/raw/face-indexer/assets/build-passing.svg" alt="alt text" title="Logo Title Text 1"></td>
<td><img src="https://gitlab.memri.io/memri/pyintegrators/-/raw/prod/assets/build-passing.svg" alt="Build passing" title="Build passing"></td>
</tr>
<tr>
<td><a href="/integrators/indexers.GeoIndexer.html#GeoIndexer"><code>GeoIndexer</code></a></td>
<td>Adds Countries and Cities to items with a location.</td>
<td><img src="https://gitlab.memri.io/memri/pyintegrators/-/raw/face-indexer/assets/build-passing.svg" alt="alt text" title="Logo Title Text 1"></td>
<td><img src="https://gitlab.memri.io/memri/pyintegrators/-/raw/prod/assets/build-passing.svg" alt="Build passing" title="Build passing"></td>
</tr>
<tr>
<td><a href="/integrators/importers.GmailImporter.html#GmailImporter"><code>GmailImporter</code></a></td>
<td>Imports email from GMail.</td>
<td><img src="https://gitlab.memri.io/memri/pyintegrators/-/raw/prod/assets/build-passing.svg" alt="Build passing" title="Build passing"></td>
</tr>
<tr>
<td><a href="/integrators/indexers.NoteListIndexer.html#NotesListIndexer"><code>NotesListIndexer</code></a></td>
<td>Extracts lists from notes and categorizes them.</td>
<td><img src="https://gitlab.memri.io/memri/pyintegrators/-/raw/face-indexer/assets/build-passing.svg" alt="alt text" title="Logo Title Text 1"></td>
<td><img src="https://gitlab.memri.io/memri/pyintegrators/-/raw/prod/assets/build-passing.svg" alt="Build passing" title="Build passing"></td>
</tr>
</tbody>
</table>
......@@ -78,125 +90,63 @@ nb_path: "nbs/index.ipynb"
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Integrators for memri have a single repo per language, this repo is the one for python, but other repo's exist for <a href="https://gitlab.memri.io/memri/nodeintegrators">node</a> and we are planning to create one for rust. This repo is built with <a href="https://github.com/fastai/nbdev">nbdev</a> and therefore all code/documentation/tests are written in one place as jupyter notebooks and exported to a python-package/jekyll-website/unit-tests.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Install">Install<a class="anchor-link" href="#Install"> </a></h2>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<pre><code>pip install -e integrators
nbdev_install_git_hooks</code></pre>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>This last command clears your notebooks of unnecessary metadata when making a commit.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Build">Build<a class="anchor-link" href="#Build"> </a></h2>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>To enable calling integrators from the <a href="https://gitlab.memri.io/memri/pod">pod</a> the integrator docker containers needs to be built. <em>You can skip this if you are developing an indexer locally and you don't want to integrate with the pod yet.</em> To build, run:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<pre><code>./examples/build.sh</code></pre>
<p>Integrators for Memri have a single repository per language, this repository is the one for Python, but others exist for <a href="https://gitlab.memri.io/memri/nodeintegrators">Node.js</a> and <a href="https://gitlab.memri.io/memri/rustintegrators">Rust</a>. This repository makes use of <a href="https://github.com/fastai/nbdev">nbdev</a>, which means all code, documentation and tests are made in Jupyter Notebooks and exported to a Python package, a Jekyll documentation and unit tests.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Now, the pod is able to find the integrator container when calling it.</p>
<h2 id="Using-Docker">Using Docker<a class="anchor-link" href="#Using-Docker"> </a></h2><p>Integrators are invoked by the Pod by launching a Docker container. To build the images for these containers, run:</p>
<div class="highlight"><pre><span></span>docker build -t memri-pyintegrators .
</pre></div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="How-to-develop-with-nbdev">How to develop with nbdev<a class="anchor-link" href="#How-to-develop-with-nbdev"> </a></h2>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The python integrators are written in nbdev. With nbdev, you use jupyter notebooks as a single source of truth, and generate the library, documentation and tests from the notebooks. The <a href="https://github.com/fastai/nbdev">nbdev website</a> contains great documentation that will help you understand how to develop with it. If you don't want to read that, the most important things to get you started are:</p>
<h2 id="Local-build">Local build<a class="anchor-link" href="#Local-build"> </a></h2><h3 id="Install">Install<a class="anchor-link" href="#Install"> </a></h3><p>To install the Python package:</p>
<div class="highlight"><pre><span></span>pip install -e .
</pre></div>
<p>If you want to contribute, you have to clean the Jupyter Notebooks every time before you push code to prevent conflicts
in the Notebooks' metadata. A script to do so can be installed using:</p>
<div class="highlight"><pre><span></span>nbdev_install_git_hooks
</pre></div>
<h3 id="Jupyter-Notebooks">Jupyter Notebooks<a class="anchor-link" href="#Jupyter-Notebooks"> </a></h3><p>The Python integrators are written in nbdev. With nbdev, you write all code in
<a href="https://jupyter.readthedocs.io/en/latest/install/notebook-classic.html">Jupyter Notebooks</a>, and generate the library, documentation and tests using the nbdev CLI.</p>
<h3 id="nbdev">nbdev<a class="anchor-link" href="#nbdev"> </a></h3><p>With nbdev we create the code in Notebooks, where we specify the use off cells using special tags. See the <a href="https://nbdev.fast.ai/">nbdev documentation</a> for a all functionalities and tutorials, the most important tags are listed below.</p>
<h4 id="nbdev-tags">nbdev tags<a class="anchor-link" href="#nbdev-tags"> </a></h4><ul>
<li>Notebooks that start their name with an underscore, are ignored by nbdev completely</li>
<li>Add <code>#default_exp &lt;packagename&gt;.&lt;modulename&gt;</code> to the top of your notebook to define the Python module to export to</li>
<li>Add <code>#export</code> to the cells that define functions to include in the Python modules.</li>
<li>All cells without the <code>#export</code> tag, are tests by default</li>
<li>All cells are included in the documentation, unless you add the keyword <code>#hide</code></li>
</ul>
<h4 id="nbdev-CLI">nbdev CLI<a class="anchor-link" href="#nbdev-CLI"> </a></h4><p>After developing your code in Notebooks, you can use the nbdev CLI:</p>
<ul>
<li>Add <code>#export</code> flags to the cells that define the functions you want to include in your python modules.</li>
<li>Add <code>#default_exp &lt;packagename&gt;.&lt;modulename&gt;</code> to the top of your notebook to define the python module to export to.</li>
<li>All cell's that are not exported are tests by default</li>
<li><code>nbdev_build_lib</code> to convert the Notebooks to the library and tests </li>
<li><code>nbdev_test_nbs</code> to run the tests</li>
<li><code>nbdev_build_docs</code> to generate the docs</li>
<li><code>nbdev_clean_nbs</code> to clean the Notebooks' metadata to prevent Git conflicts</li>
</ul>
<p>When you are done writing your code in notebooks, call <code>nbdev_build_lib</code> to convert the notebooks to code and tests. Call <code>nbdev_build_docs</code> to generate the docs.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Run-tests">Run tests<a class="anchor-link" href="#Run-tests"> </a></h3>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Every cell without the <code>#export</code> flag will be a test. So make sure that the code in notebooks runs fast and without errors. You can run all tests by calling.</p>
<pre><code>nbdev_test_nbs</code></pre>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Docs">Docs<a class="anchor-link" href="#Docs"> </a></h2>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>If you want to hide certain functionality in the docs, you can use the <code>#hide</code> flag in the top of a cell</p>
<h3 id="Contributing">Contributing<a class="anchor-link" href="#Contributing"> </a></h3><p>Before you make a merge request, make sure that you used all the nbdev commands specified above, or GitLab's CI won't pass.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Render-docs-locally">Render docs locally<a class="anchor-link" href="#Render-docs-locally"> </a></h3>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Often you might want to check your docs locally before deploying them. To do so, first install Jekyll:</p>
<pre><code>gem install bundler jekyll
bundle install</code></pre>
<p>Then, run the Jekyll server:</p>
<pre><code>cd docs
bundle exec jekyll serve</code></pre>
<p>And thats it!</p>
<h2 id="Docs">Docs<a class="anchor-link" href="#Docs"> </a></h2><p>Find the online docs at <a href="https://pyintegrators.docs.memri.io/">pyintegrators.docs.memri.io</a>.</p>
<h3 id="Render-documentation-locally">Render documentation locally<a class="anchor-link" href="#Render-documentation-locally"> </a></h3><p>New documentation will be deployed automatically when a new version is released to the <code>prod</code> branch. To inspect the documentation beforehand, you can run it local machine by <a href="https://jekyllrb.com/docs/installation/">installing Jekyll</a>.</p>
<p>To build the documentation:</p>
<div class="highlight"><pre><span></span><span class="nb">cd</span> docs
gem update --system
bundle install
</pre></div>
<p>To serve the documentation:</p>
<div class="highlight"><pre><span></span>bundle <span class="nb">exec</span> jekyll serve
</pre></div>
</div>
</div>
......
......@@ -61,7 +61,7 @@ nb_path: "nbs/indexers.FaceRecognitionIndexer.ipynb"
 
 
<div class="output_markdown rendered_html output_subarea ">
<h2 id="FaceRecognitionIndexer" class="doc_header"><code>class</code> <code>FaceRecognitionIndexer</code><a href="https://gitlab.memri.io/memri/integrators/tree/master/integrators/indexers/facerecognition/facerecognition_indexer.py#L49" class="source_link" style="float:right">[source]</a></h2><blockquote><p><code>FaceRecognitionIndexer</code>(<strong>*<code>args</code></strong>, <strong>**<code>kwargs</code></strong>) :: <a href="/integrators/indexers.indexer.html#IndexerBase"><code>IndexerBase</code></a></p>
<h2 id="FaceRecognitionIndexer" class="doc_header"><code>class</code> <code>FaceRecognitionIndexer</code><a href="https://gitlab.memri.io/memri/integrators/tree/master/integrators/indexers/facerecognition/facerecognition_indexer.py#L33" class="source_link" style="float:right">[source]</a></h2><blockquote><p><code>FaceRecognitionIndexer</code>(<strong>*<code>args</code></strong>, <strong>**<code>kwargs</code></strong>) :: <a href="/integrators/indexers.indexer.html#IndexerBase"><code>IndexerBase</code></a></p>
</blockquote>
<p>Recognizes photos from faces.</p>
 
......@@ -93,7 +93,7 @@ nb_path: "nbs/indexers.FaceRecognitionIndexer.ipynb"
 
 
<div class="output_markdown rendered_html output_subarea ">
<h2 id="IPhoto" class="doc_header"><code>class</code> <code>IPhoto</code><a href="https://gitlab.memri.io/memri/integrators/tree/master/integrators/indexers/facerecognition/facerecognition_indexer.py#L68" class="source_link" style="float:right">[source]</a></h2><blockquote><p><code>IPhoto</code>(<strong><code>file</code></strong>=<em><code>None</code></em>, <strong>*<code>args</code></strong>, <strong>**<code>kwargs</code></strong>) :: <code>Photo</code></p>
<h2 id="IPhoto" class="doc_header"><code>class</code> <code>IPhoto</code><a href="https://gitlab.memri.io/memri/integrators/tree/master/integrators/indexers/facerecognition/facerecognition_indexer.py#L52" class="source_link" style="float:right">[source]</a></h2><blockquote><p><code>IPhoto</code>(<strong><code>file</code></strong>=<em><code>None</code></em>, <strong>*<code>args</code></strong>, <strong>**<code>kwargs</code></strong>) :: <code>Photo</code></p>
</blockquote>
<p>Provides a base class for all items. All items in the schema inherit from this class, and it provides some
basic functionality for consistency and to enable easier usage.</p>
......@@ -126,7 +126,7 @@ basic functionality for consistency and to enable easier usage.</p>
 
 
<div class="output_markdown rendered_html output_subarea ">
<h4 id="show_images" class="doc_header"><code>show_images</code><a href="https://gitlab.memri.io/memri/integrators/tree/master/integrators/indexers/facerecognition/facerecognition_indexer.py#L33" class="source_link" style="float:right">[source]</a></h4><blockquote><p><code>show_images</code>(<strong><code>images</code></strong>, <strong><code>cols</code></strong>=<em><code>3</code></em>, <strong><code>titles</code></strong>=<em><code>None</code></em>)</p>
<h4 id="show_images" class="doc_header"><code>show_images</code><a href="https://gitlab.memri.io/memri/integrators/tree/master/integrators/indexers/facerecognition/facerecognition_indexer.py#L111" class="source_link" style="float:right">[source]</a></h4><blockquote><p><code>show_images</code>(<strong><code>images</code></strong>, <strong><code>cols</code></strong>=<em><code>3</code></em>, <strong><code>titles</code></strong>=<em><code>None</code></em>)</p>
</blockquote>
 
</div>
......@@ -270,31 +270,6 @@ when retrieving it from the database.</p>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>1 items found to index
</pre>
</div>
</div>
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>IndexerData
{&#39;items_with_location&#39;: [Address (#2)]}</pre>
</div>
</div>
</div>
</div>
</div>
{% endraw %}
......@@ -312,23 +287,6 @@ when retrieving it from the database.</p>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>indexing 1 items
Loading formatted geocoded file...
creating IndexerRun (#4)
(&#39;Connection aborted.&#39;, RemoteDisconnected(&#39;Remote end closed connection without response&#39;))
</pre>
</div>
</div>
</div>
</div>
</div>
{% endraw %}
......@@ -346,21 +304,6 @@ creating IndexerRun (#4)
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>creating Country (#None)
updating Address (#2)
</pre>
</div>
</div>
</div>
</div>
</div>
{% endraw %}
......@@ -405,7 +348,7 @@ updating Address (#2)
<div class="output_markdown rendered_html output_subarea ">
<h4 id="run_integrator" class="doc_header"><code>run_integrator</code><a href="https://gitlab.memri.io/memri/integrators/tree/master/integrators/indexers/indexer.py#L78" class="source_link" style="float:right">[source]</a></h4><blockquote><p><code>run_integrator</code>(<strong><code>environ</code></strong>=<em><code>None</code></em>, <strong><code>pod_full_address</code></strong>=<em><code>None</code></em>, <strong><code>integrator_run_uid</code></strong>=<em><code>None</code></em>, <strong><code>database_key</code></strong>=<em><code>None</code></em>, <strong><code>owner_key</code></strong>=<em><code>None</code></em>, <strong><code>verbose</code></strong>=<em><code>False</code></em>)</p>
<h4 id="run_integrator" class="doc_header"><code>run_integrator</code><a href="https://gitlab.memri.io/memri/integrators/tree/master/integrators/indexers/indexer.py#L89" class="source_link" style="float:right">[source]</a></h4><blockquote><p><code>run_integrator</code>(<strong><code>environ</code></strong>=<em><code>None</code></em>, <strong><code>pod_full_address</code></strong>=<em><code>None</code></em>, <strong><code>integrator_run_uid</code></strong>=<em><code>None</code></em>, <strong><code>database_key</code></strong>=<em><code>None</code></em>, <strong><code>owner_key</code></strong>=<em><code>None</code></em>, <strong><code>verbose</code></strong>=<em><code>False</code></em>)</p>
</blockquote>
<p>Runs an integrator, you can either provide the run settings as parameters to this function (for local testing)
or via environment variables (this is how the pod communicates with integrators).</p>
......@@ -444,24 +387,6 @@ or via environment variables (this is how the pod communicates with integrators)
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>1 items found to index
indexing 1 items
updating IndexerRun (#9)
creating Country (#None)
updating Address (#7)
</pre>
</div>
</div>
</div>
</div>
</div>
{% endraw %}
......@@ -495,22 +420,20 @@ updating Address (#7)
</div>
</div>
<div class="output_wrapper">
<div class="output">
</div>
{% endraw %}
<div class="output_area">
{% raw %}
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="output_subarea output_stream output_stdout output_text">
<pre>Reading run parameters from environment variables
1 items found to index
indexing 1 items
updating IndexerRun (#14)
creating Country (#None)
updating Address (#12)
</pre>
</div>
</div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">client</span><span class="o">.</span><span class="n">delete_all</span><span class="p">()</span>
</pre></div>
</div>
</div>
</div>
......
......@@ -61,7 +61,7 @@ nb_path: "nbs/pod.client.ipynb"
<div class="output_markdown rendered_html output_subarea ">
<h2 id="PodClient" class="doc_header"><code>class</code> <code>PodClient</code><a href="https://gitlab.memri.io/memri/integrators/tree/master/integrators/pod/client.py#L14" class="source_link" style="float:right">[source]</a></h2><blockquote><p><code>PodClient</code>(<strong><code>url</code></strong>=<em><code>'http://localhost:3030/v2'</code></em>, <strong><code>database_key</code></strong>=<em><code>None</code></em>, <strong><code>owner_key</code></strong>=<em><code>None</code></em>)</p>
<h2 id="PodClient" class="doc_header"><code>class</code> <code>PodClient</code><a href="https://gitlab.memri.io/memri/integrators/tree/master/integrators/pod/client.py#L15" class="source_link" style="float:right">[source]</a></h2><blockquote><p><code>PodClient</code>(<strong><code>url</code></strong>=<em><code>'http://localhost:3030'</code></em>, <strong><code>version</code></strong>=<em><code>'v2'</code></em>, <strong><code>database_key</code></strong>=<em><code>None</code></em>, <strong><code>owner_key</code></strong>=<em><code>None</code></em>)</p>
</blockquote>
</div>
......@@ -98,20 +98,6 @@ nb_path: "nbs/pod.client.ipynb"
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>Succesfully connected to pod
</pre>
</div>
</div>
</div>
</div>
</div>
{% endraw %}
......@@ -143,22 +129,6 @@ nb_path: "nbs/pod.client.ipynb"
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>EmailMessage (#None)</pre>
</div>
</div>
</div>
</div>
</div>
{% endraw %}
......@@ -178,22 +148,6 @@ nb_path: "nbs/pod.client.ipynb"
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>EmailMessage (#1)</pre>
</div>
</div>
</div>
</div>
</div>
{% endraw %}
......@@ -225,22 +179,6 @@ nb_path: "nbs/pod.client.ipynb"
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>Person (#2) --author-&gt; EmailMessage (#1)</pre>
</div>
</div>
</div>
</div>
</div>
{% endraw %}
......@@ -276,22 +214,6 @@ nb_path: "nbs/pod.client.ipynb"
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>Person (#3)</pre>
</div>
</div>
</div>
</div>
</div>
{% endraw %}
......@@ -321,22 +243,6 @@ nb_path: "nbs/pod.client.ipynb"
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>Person (#3)</pre>
</div>
</div>
</div>
</div>
</div>
{% endraw %}
......@@ -367,22 +273,6 @@ nb_path: "nbs/pod.client.ipynb"
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>[Person (#2), Person (#3), Person (#4)]</pre>
</div>
</div>
</div>
</div>
</div>
{% endraw %}
......
......@@ -5,7 +5,8 @@
"ItemBase": "itembase.html"
},
"Importers": {
"Overview": "indexers.indexer.html"
"Overview": "importers.Importer.html",
"GmailImporter": "importers.GmailImporter.html"
},
"Indexers": {
"Overview": "indexers.indexer.html",
......
......@@ -8,14 +8,15 @@ index = {"read_file": "basic.ipynb",
"Path.ls": "basic.ipynb",
"PYI_HOME": "basic.ipynb",
"PYI_TESTDATA": "basic.ipynb",
"IMAPClient": "gmail.ipynb",
"get_message_content": "gmail.ipynb",
"get_addresses_from_message": "gmail.ipynb",
"get_timestamp_from_message": "gmail.ipynb",
"create_item_from_mail": "gmail.ipynb",
"download_mails": "gmail.ipynb",
"merge_duplicate_items": "gmail.ipynb",
"GmailImporter": "gmail.ipynb",
"IMAPClient": "importers.GmailImporter.ipynb",
"get_message_content": "importers.GmailImporter.ipynb",
"get_addresses_from_message": "importers.GmailImporter.ipynb",
"get_timestamp_from_message": "importers.GmailImporter.ipynb",
"create_item_from_mail": "importers.GmailImporter.ipynb",
"download_mails": "importers.GmailImporter.ipynb",
"merge_duplicate_items": "importers.GmailImporter.ipynb",
"GmailImporter": "importers.GmailImporter.ipynb",
"ImporterBase": "importers.Importer.ipynb",
"FaceRecognitionIndexer": "indexers.FaceRecognitionIndexer.ipynb",
"IPhoto": "indexers.FaceRecognitionIndexer.ipynb",
"show_images": "indexers.FaceRecognitionIndexer.ipynb",
......@@ -41,7 +42,6 @@ index = {"read_file": "basic.ipynb",
"contains": "indexers.NoteListIndexer.util.ipynb",
"HTML_LINEBREAK_REGEX": "indexers.NoteListIndexer.util.ipynb",
"IndexerBase": "indexers.indexer.ipynb",
"ImporterBase": "indexers.indexer.ipynb",
"IndexerData": "indexers.indexer.ipynb",
"get_indexer_run_data": "indexers.indexer.ipynb",
"test_registration": "indexers.indexer.ipynb",
......@@ -65,6 +65,7 @@ index = {"read_file": "basic.ipynb",
modules = ["data/basic.py",
"importers/gmail.py",
"importers/importer.py",
"indexers/facerecognition/facerecognition_indexer.py",
"indexers/geo/geo_indexer.py",
"indexers/notelist/notelist.py",
......
# AUTOGENERATED! DO NOT EDIT! File to edit: nbs/gmail.ipynb (unless otherwise specified).
# AUTOGENERATED! DO NOT EDIT! File to edit: nbs/importers.GmailImporter.ipynb (unless otherwise specified).
__all__ = ['IMAPClient', 'get_message_content', 'get_addresses_from_message', 'get_timestamp_from_message',
'create_item_from_mail', 'download_mails', 'merge_duplicate_items', 'GmailImporter']
# Cell
import imaplib, email
from ..data.schema import Account, EmailMessage, MessageChannel
from ..pod.client import PodClient
from email import policy
# Cell
class IMAPClient():
def __init__(self, username, app_pw, host='imap.gmail.com', port=993, inbox='"[Gmail]/All Mail"'):
......@@ -201,9 +199,11 @@ def merge_duplicate_items(all_mails):
from ..data.schema import *
from ..imports import *
from ..indexers.indexer import ImporterBase, test_registration
from ..indexers.indexer import test_registration
from .importer import ImporterBase
class GmailImporter(ImporterBase):
"""Imports email from GMail."""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
......
# AUTOGENERATED! DO NOT EDIT! File to edit: nbs/importers.Importer.ipynb (unless otherwise specified).
__all__ = ['ImporterBase']
# Cell
from ..data.schema import *
class ImporterBase(Importer):
def __init__(self, importerClass=None, *args, **kwargs):
super().__init__(*args, **kwargs)
\ No newline at end of file
# AUTOGENERATED! DO NOT EDIT! File to edit: nbs/indexers.indexer.ipynb (unless otherwise specified).
__all__ = ['IndexerBase', 'ImporterBase', 'IndexerData', 'get_indexer_run_data', 'test_registration',
'POD_FULL_ADDRESS_ENV', 'RUN_UID_ENV', 'POD_SERVICE_PAYLOAD_ENV', 'DATABASE_KEY_ENV', 'OWNER_KEY_ENV',
'run_indexer', 'run_importer', 'run_integrator_from_run_uid', 'run_integrator']
__all__ = ['IndexerBase', 'IndexerData', 'get_indexer_run_data', 'test_registration', 'POD_FULL_ADDRESS_ENV',
'RUN_UID_ENV', 'POD_SERVICE_PAYLOAD_ENV', 'DATABASE_KEY_ENV', 'OWNER_KEY_ENV', 'run_indexer', 'run_importer',
'run_integrator_from_run_uid', 'run_integrator']
# Cell
from ..data.schema import *
......@@ -33,11 +33,6 @@ class IndexerBase(Indexer):
for item in updated_items:
item.update(api)
class ImporterBase(Importer):
def __init__(self, importerClass=None, *args, **kwargs):
super().__init__(*args, **kwargs)
class IndexerData():
def __init__(self, **kwargs):
for k, v in kwargs.items():
......
%% Cell type:code id: tags:
``` python
%load_ext autoreload
%autoreload 2
# default_exp importers.gmail
```
%% Cell type:code id: tags:
``` python
# export
import imaplib, email
from integrators.data.schema import Account, EmailMessage, MessageChannel
from integrators.pod.client import PodClient
from email import policy
```
%% Cell type:markdown id: tags:
# Gmail Importer
%% Cell type:code id: tags:
``` python
# export
class IMAPClient():
def __init__(self, username, app_pw, host='imap.gmail.com', port=993, inbox='"[Gmail]/All Mail"'):
# Quick fix to support Google's threading method
if host == 'imap.gmail.com':
self.x_gm_thrid_support = True
else:
self.x_gm_thrid_support = False
self.client = imaplib.IMAP4_SSL(host, port=port)
self.client.login(username, app_pw)
self.client.select(inbox) # connect to inbox.
def list_mailboxes(self):
return self.client.list()
def get_all_mail_uids(self):
result, data = self.client.uid('search', None, "ALL") # search and return uids instead
return data[0].split()
def get_mail(self, uid):
if self.x_gm_thrid_support:
# Use Google's threading method, in which every thread has an ID
result, data = self.client.uid('fetch', uid, '(RFC822 X-GM-THRID)')
thread_id = data[0][0].decode("utf-8").split(" ")[2]
raw_email = data[0][1]
return (raw_email, thread_id)
else:
# Threading not yet implemented for IMAP threading
result, data = self.client.uid('fetch', uid, '(RFC822)')
raw_email = data[0][1]
return (raw_email, None)
# def get_all_mails(self, uids):
# res = []
# for uid in tqdm(uids):
# result, data = self.client.uid('fetch', uid, '(RFC822)')
# raw_email = data[0][1]
# res.append(raw_email)
# return res
# def get_x_gm_thrid(self, uid):
# result, data = self.client.uid('fetch', uid, '(X-GM-THRID X-GM-MSGID)')
# return data[0].decode("utf-8").split(" ")[2]
# # @staticmethod
# def part_to_str(part):
# bytes_ = part.get_payload(decode=True)
# charset = part.get_content_charset('iso-8859-1')
# chars = bytes_.decode(charset, 'replace')
# return chars
# # @staticmethod
# def get_html(email_message_instance):
# maintype = email_message_instance.get_content_maintype()
# if maintype == 'multipart':
# parts = _get_all_parts(email_message_instance)
# res = None
# html_parts = [part_to_str(part) for part in parts if part.get_content_type() == "text/html"]
# if len(html_parts) > 0:
# if len(html_parts) > 1:
# error_msg = "\n AND \n".join(html_parts)
# print(f"WARNING: FOUND MULTIPLE HTML PARTS IN ONE MESSAGE {error_msg}")
# return html_parts[0]
# else:
# return parts[0].get_payload()
# elif maintype == 'text':
# return email_message_instance.get_payload()
# # @staticmethod
# def _get_all_parts(part):
# payload = part.get_payload()
# if isinstance(payload, list):
# return [x for p in payload for x in _get_all_parts(p)]
# else:
# return [part]
def get_message_content(message):
# content = get_html(message)
# # TODO: proper escaping here
# content = content.replace("=3D", "=")
attachments = []
# SEPARATE THE ATTACHMENTS
for i, x in enumerate(message.iter_attachments()):
attachments.append(x)
#f = open(f"tmp/gmail/{i}.png", 'wb')
#f.write(x.get_content())
#f.close()
content = message.get_body().get_content()
return (content, attachments)
def get_addresses_from_message(message, field):
if message[field] is None:
return []
else:
return email.utils.getaddresses([message[field]])
def get_timestamp_from_message(message):
date = message["date"]
parsed_time = email.utils.parsedate(date)
dt = email.utils.parsedate_to_datetime(date)
timestamp = int(dt.timestamp() * 1000)
return timestamp
```
%% Cell type:code id: tags:
``` python
# export
def create_item_from_mail(mail, thread_id=None):
# message = email.message_from_string(mail_utf8)
message = email.message_from_bytes(mail, policy=policy.SMTP)
message_id = message["message-id"]
subject = message["subject"]
timestamp = get_timestamp_from_message(message)
from_tuples = get_addresses_from_message(message,'from')
to_tuples = get_addresses_from_message(message,'to')
reply_to_tuples = get_addresses_from_message(message,'reply-to')
# TODO: verbose option?
# print(f'{[address for name, address in from_tuples]} - {subject} [{thread_id}]')
(content, attachments) = get_message_content(message)
# TODO: is dateSent the right way to go? Might differ for whether you're sender or receiver
# TODO: importJson
# TODO: MAIL namespace
email_item = EmailMessage(externalId=message_id, subject=subject, dateSent=timestamp, content=content)
# Create Edges to accounts
for name, address in from_tuples:
address_item = Account(externalId=address)
email_item.add_edge('sender', address_item)
for name, address in to_tuples:
address_item = Account(externalId=address)
email_item.add_edge('receiver', address_item)
for name, address in reply_to_tuples:
address_item = Account(externalId=address)
email_item.add_edge('replyTo', address_item)
# Create edge to MessageChannel
if thread_id != None:
message_channel = MessageChannel(externalId=thread_id)
email_item.add_edge('messageChannel', message_channel)
return email_item
def download_mails(imap_client, gmail_ids, stop_at):
all_mails = []
# Download files
for i, gmail_id in enumerate(gmail_ids):
if stop_at is not None and i >= stop_at:
print(f"stopped early at {stop_at}")
break
mail, thread_id = imap_client.get_mail(gmail_id)
# thread_id = imap_client.get_x_gm_thrid(gmail_id)
item = create_item_from_mail(mail, thread_id=thread_id)
all_mails.append(item)
return all_mails
# TODO: should probably become a general utility function
def merge_duplicate_items(all_mails):
all_accounts = {}
for email_item in all_mails:
for edge in email_item.get_all_edges():
account = edge.traverse(email_item)
if not account.externalId in all_accounts:
all_accounts[account.externalId] = account
for email_item in all_mails:
for edge in email_item.get_all_edges():
edge.target = all_accounts[edge.target.externalId]
return all_accounts
```
%% Cell type:code id: tags:
``` python
# export
from integrators.data.schema import *
from integrators.imports import *
from integrators.indexers.indexer import ImporterBase, test_registration
from integrators.indexers.indexer import test_registration
from integrators.importers.importer import ImporterBase
class GmailImporter(ImporterBase):
"""Imports email from GMail."""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
def get_data(self, client, indexer_run):
print('this function is a workaround (this Importer is an Indexer temporarily)')
def run(self, importer_run, pod_client=None):
# TODO: Get imap_host from importer_run
imap_host = 'imap.gmail.com'
imap_client = IMAPClient(username=importer_run.username,
app_pw=importer_run.password,
host=imap_host,
port=993)
gmail_ids = imap_client.get_all_mail_uids()
all_mails = download_mails(imap_client, gmail_ids, 10)
# Merge email accounts/messageChannels here
# TODO: create better way to do this
all_accounts = merge_duplicate_items(all_mails)
# Create all email and account items
all_thread_ids = set()
for email_item in all_mails:
pod_client.create(email_item)
for message_channel in email_item.messageChannel:
all_thread_ids.add(message_channel.externalId)
for (external_id, item) in all_accounts.items():
pod_client.create(item)
# Create all edges from emails to accounts/messageThreads
for email_item in all_mails:
pod_client.create_edges(email_item.get_all_edges())
```
%% Cell type:code id: tags:
``` python
%nbdev_slow_test
# Store your credentials in this file:
file = open('tmp/credentials_gmail.txt','r')
imap_host = 'imap.gmail.com'
imap_user = file.readline().strip('\n')
imap_pw = file.readline().strip('\n')
pod_client = PodClient()
pod_client.delete_all()
importer_run = ImporterRun.from_data(progress=0, username=imap_user, password=imap_pw)
importer = GmailImporter.from_data()
importer.run(importer_run=importer_run, pod_client=pod_client)
```
%% Output
UsageError: Line magic function `%nbdev_slow_test` not found.
%% Cell type:code id: tags:
``` python
test = b"""\
Message-id: 1234\r
From: user1 <a@gmail.com>\r
To: user1 <b@gmail.com>\r
Reply-to: user1 <c@gmail.com>\r
Subject: the subject\r
Date: Mon, 04 May 2020 00:37:44 -0700\r
This is content"""
#mail_message = email.message_from_string(test)
mail_item = create_item_from_mail(test, 'message_channel_id')
assert mail_item.externalId == '1234'
assert mail_item.sender[0].externalId == 'a@gmail.com'
assert mail_item.receiver[0].externalId == 'b@gmail.com'
assert mail_item.replyTo[0].externalId == 'c@gmail.com'
assert mail_item.subject == 'the subject'
assert mail_item.content == 'This is content'
assert mail_item.dateSent == get_timestamp_from_message(email.message_from_bytes(test))
assert mail_item.messageChannel[0].externalId == 'message_channel_id'
```
%% Cell type:code id: tags:
``` python
# Test attachment parsing (basic support)
message = email.message.EmailMessage()
message.set_content('aa')
message.add_attachment(b'bb', maintype='image', subtype='jpeg', filename='sample.jpg')
message.add_attachment(b'cc', maintype='image', subtype='jpeg', filename='sample2.jpg')
content, attachments = get_message_content(message)
assert content == 'aa\n'
assert attachments[0].get_content() == b'bb'
assert attachments[1].get_content() == b'cc'
```
......
%% Cell type:code id: tags:
``` python
%load_ext autoreload
%autoreload 2
# default_exp importers.importer
```
%% Cell type:markdown id: tags:
# Importer
%% Cell type:markdown id: tags:
Importers download your data from other sources. For retrieving the data from other sources importers rely on downloaders.
%% Cell type:code id: tags:
``` python
# export
from integrators.data.schema import *
class ImporterBase(Importer):
def __init__(self, importerClass=None, *args, **kwargs):
super().__init__(*args, **kwargs)
```
%% Cell type:code id: tags:
``` python
# hide
from integrators import *
```
%% Cell type:markdown id: tags:
# Integrators
# pyintegrators
> Integrators integrate your information in your Pod. They import data from external services (Gmail, WhatsApp, etc.), enrich data with indexers (face recognition, spam detection, etc.), and execute actions (sending messages, uploading files, etc.).
%% Cell type:markdown id: tags:
> Integrators integrate your information in the pod. They import your data from external services (gmail, whatsapp, icloud, facebook etc.), enrich your data with indexers (face recognition, spam detection, duplicate photo detection), and execute actions (sending mails, automatically share selected photo's with your family).
# Overview
We start by listing the existing indexers and their functionalities, make sure to check out their pages for usage examples.
%% Cell type:code id: tags:
``` python
# hide
from IPython.display import Markdown as md
import integrators.integrator_registry
from integrators.data.basic import *
from integrators.imports import *
from nbdev.cli import _test_one
def get_notebook_from_cls(cls):
path = inspect.getfile(cls)
f_content = read_file(path)
file = re.search("(?<=File to edit: )[^\s]*", f_content).group()
return file[4:] # remove 'nbs/'
txt_passing = '![alt text](https://gitlab.memri.io/memri/pyintegrators/-/raw/face-indexer/assets/build-passing.svg "Logo Title Text 1")'
txt_failing = '![alt text](https://gitlab.memri.io/memri/pyintegrators/-/raw/face-indexer/assets/build-failing.svg "Logo Title Text 1")'
txt_passing = '![Build passing](https://gitlab.memri.io/memri/pyintegrators/-/raw/prod/assets/build-passing.svg "Build passing")'
txt_failing = '![Build failing](https://gitlab.memri.io/memri/pyintegrators/-/raw/prod/assets/build-failing.svg "Build failing")'
table = f"""
| Integrator | Description | Tests passing |
|------------|-------------|---------------|
"""
for m in dir(integrators.integrator_registry):
if "__" not in m:
cls = getattr(integrators.integrator_registry, m)
nb = get_notebook_from_cls(cls)
test_succeeded = _test_one(nb, verbose=False)[0]
build_txt = txt_passing if test_succeeded else txt_failing
table += f"|`{cls.__name__}`|{cls.__doc__ if cls.__doc__ is not None else ''}| {build_txt}|\n"
```
%% Output
testing indexers.FaceRecognitionIndexer.ipynb
testing indexers.GeoIndexer.ipynb
testing importers.GmailImporter.ipynb
testing indexers.NoteListIndexer.ipynb
%% Cell type:code id: tags:
``` python
# hide_input
md(table)
```
%% Output
| Integrator | Description | Tests passing |
|------------|-------------|---------------|
|`FaceRecognitionIndexer`|Recognizes photos from faces.| ![alt text](https://gitlab.memri.io/memri/pyintegrators/-/raw/face-indexer/assets/build-passing.svg "Logo Title Text 1")|
|`GeoIndexer`|Adds Countries and Cities to items with a location.| ![alt text](https://gitlab.memri.io/memri/pyintegrators/-/raw/face-indexer/assets/build-passing.svg "Logo Title Text 1")|
|`NotesListIndexer`|Extracts lists from notes and categorizes them.| ![alt text](https://gitlab.memri.io/memri/pyintegrators/-/raw/face-indexer/assets/build-passing.svg "Logo Title Text 1")|
|`FaceRecognitionIndexer`|Recognizes photos from faces.| ![Build passing](https://gitlab.memri.io/memri/pyintegrators/-/raw/prod/assets/build-passing.svg "Build passing")|
|`GeoIndexer`|Adds Countries and Cities to items with a location.| ![Build passing](https://gitlab.memri.io/memri/pyintegrators/-/raw/prod/assets/build-passing.svg "Build passing")|
|`GmailImporter`|Imports email from GMail.| ![Build passing](https://gitlab.memri.io/memri/pyintegrators/-/raw/prod/assets/build-passing.svg "Build passing")|
|`NotesListIndexer`|Extracts lists from notes and categorizes them.| ![Build passing](https://gitlab.memri.io/memri/pyintegrators/-/raw/prod/assets/build-passing.svg "Build passing")|
<IPython.core.display.Markdown object>
%% Cell type:markdown id: tags:
Integrators for memri have a single repo per language, this repo is the one for python, but other repo's exist for [node](https://gitlab.memri.io/memri/nodeintegrators) and we are planning to create one for rust. This repo is built with [nbdev](https://github.com/fastai/nbdev) and therefore all code/documentation/tests are written in one place as jupyter notebooks and exported to a python-package/jekyll-website/unit-tests.
%% Cell type:markdown id: tags:
## Install
Integrators for Memri have a single repository per language, this repository is the one for Python, but others exist for [Node.js](https://gitlab.memri.io/memri/nodeintegrators) and [Rust](https://gitlab.memri.io/memri/rustintegrators). This repository makes use of [nbdev](https://github.com/fastai/nbdev), which means all code, documentation and tests are made in Jupyter Notebooks and exported to a Python package, a Jekyll documentation and unit tests.
%% Cell type:markdown id: tags:
## Using Docker
Integrators are invoked by the Pod by launching a Docker container. To build the images for these containers, run:
```bash
docker build -t memri-pyintegrators .
```
pip install -e integrators
nbdev_install_git_hooks
```
%% Cell type:markdown id: tags:
This last command clears your notebooks of unnecessary metadata when making a commit.
%% Cell type:markdown id: tags:
## Build
%% Cell type:markdown id: tags:
To enable calling integrators from the [pod](https://gitlab.memri.io/memri/pod) the integrator docker containers needs to be built. *You can skip this if you are developing an indexer locally and you don't want to integrate with the pod yet.* To build, run:
%% Cell type:markdown id: tags:
## Local build
### Install
To install the Python package:
```bash
pip install -e .
```
./examples/build.sh
```
%% Cell type:markdown id: tags:
Now, the pod is able to find the integrator container when calling it.
%% Cell type:markdown id: tags:
## How to develop with nbdev
%% Cell type:markdown id: tags:
The python integrators are written in nbdev. With nbdev, you use jupyter notebooks as a single source of truth, and generate the library, documentation and tests from the notebooks. The [nbdev website](https://github.com/fastai/nbdev) contains great documentation that will help you understand how to develop with it. If you don't want to read that, the most important things to get you started are:
- Add `#export` flags to the cells that define the functions you want to include in your python modules.
- Add `#default_exp <packagename>.<modulename>` to the top of your notebook to define the python module to export to.
- All cell's that are not exported are tests by default
When you are done writing your code in notebooks, call `nbdev_build_lib` to convert the notebooks to code and tests. Call `nbdev_build_docs` to generate the docs.
%% Cell type:markdown id: tags:
### Run tests
%% Cell type:markdown id: tags:
Every cell without the `#export` flag will be a test. So make sure that the code in notebooks runs fast and without errors. You can run all tests by calling.
```
nbdev_test_nbs
If you want to contribute, you have to clean the Jupyter Notebooks every time before you push code to prevent conflicts
in the Notebooks' metadata. A script to do so can be installed using:
```bash
nbdev_install_git_hooks
```
%% Cell type:markdown id: tags:
### Jupyter Notebooks
The Python integrators are written in nbdev. With nbdev, you write all code in
[Jupyter Notebooks](https://jupyter.readthedocs.io/en/latest/install/notebook-classic.html), and generate the library, documentation and tests using the nbdev CLI.
### nbdev
With nbdev we create the code in Notebooks, where we specify the use off cells using special tags. See the [nbdev documentation](https://nbdev.fast.ai/) for a all functionalities and tutorials, the most important tags are listed below.
#### nbdev tags
- Notebooks that start their name with an underscore, are ignored by nbdev completely
- Add `#default_exp <packagename>.<modulename>` to the top of your notebook to define the Python module to export to
- Add `#export` to the cells that define functions to include in the Python modules.
- All cells without the `#export` tag, are tests by default
- All cells are included in the documentation, unless you add the keyword `#hide`
#### nbdev CLI
After developing your code in Notebooks, you can use the nbdev CLI:
- `nbdev_build_lib` to convert the Notebooks to the library and tests
- `nbdev_test_nbs` to run the tests
- `nbdev_build_docs` to generate the docs
- `nbdev_clean_nbs` to clean the Notebooks' metadata to prevent Git conflicts
## Docs
### Contributing
Before you make a merge request, make sure that you used all the nbdev commands specified above, or GitLab's CI won't pass.
%% Cell type:markdown id: tags:
If you want to hide certain functionality in the docs, you can use the `#hide` flag in the top of a cell
%% Cell type:markdown id: tags:
### Render docs locally
%% Cell type:markdown id: tags:
## Docs
Find the online docs at [pyintegrators.docs.memri.io](https://pyintegrators.docs.memri.io/).
Often you might want to check your docs locally before deploying them. To do so, first install Jekyll:
### Render documentation locally
New documentation will be deployed automatically when a new version is released to the `prod` branch. To inspect the documentation beforehand, you can run it local machine by [installing Jekyll](https://jekyllrb.com/docs/installation/).
```
gem install bundler jekyll
To build the documentation:
```bash
cd docs
gem update --system
bundle install
```
Then, run the Jekyll server:
```
cd docs
To serve the documentation:
```bash
bundle exec jekyll serve
```
And thats it!
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment