How the archive was extracted from Evernote and converted into a browsable HTML site.
← Back to index · Guide · Changelog
The entire conversion runs locally. No APIs, no cloud services, no network requests. The output is static HTML with relative paths—no server needed to browse it.
Evernote's desktop app (v10.134.4) provides a native export feature. Selecting notebooks and choosing Export as .enex produces XML files that contain every note with its full content and resources.
The export created two artifacts:
| File | Format | Contents |
|---|---|---|
export/*.enex | XML (Evernote Export) | Notes, ENML content, base64-encoded resources, metadata |
en_backup.db | SQLite 3 | Evernote's internal database (notes, notebooks, tags, etc.) |
An .enex file is XML with this structure:
<?xml version="1.0" encoding="UTF-8"?>
<en-export application="Evernote" version="10.134.4">
<note>
<title>Note Title</title>
<created>20120803T061625Z</created>
<updated>20120803T061637Z</updated>
<note-attributes>
<source>web.clip</source>
<source-url>https://example.com/article</source-url>
<latitude>37.882</latitude>
<longitude>-122.287</longitude>
</note-attributes>
<content><![CDATA[
<?xml version="1.0" encoding="UTF-8"?>
<en-note>...ENML content...</en-note>
]]></content>
<resource>
<data encoding="base64">/9j/4AAQ...</data>
<mime>image/jpeg</mime>
<width>462</width>
<height>308</height>
<resource-attributes>
<file-name>photo.jpg</file-name>
</resource-attributes>
</resource>
</note>
<!-- ...more notes... -->
</en-export>
Key points:
<resource> has no explicit hash attribute; the hash must be computed from the decoded bytes<en-media hash="..."> tagsEvernote also exports its internal SQLite database. This is a separate artifact from the .enex files and has not been used in the current conversion. It likely contains a relational schema with tables for notes, notebooks, tags, and their relationships—potentially useful for a future Django migration.
The script uses Python's built-in xml.etree.ElementTree.iterparse to stream-parse each .enex file. This is critical because several files are enormous:
| File | Size |
|---|---|
PB1099.enex | 1.7 GB (467 notes) |
All Instagram Posts.enex | 669 MB (67 notes) |
All Photography.enex | 505 MB (249 notes) |
Loading these into a DOM tree would consume many gigabytes of RAM. Instead, iterparse fires events as elements are closed, and elem.clear() frees each note after processing. Think of it as the XML equivalent of Django's StreamingHttpResponse—constant memory regardless of file size.
context = ET.iterparse(str(enex_path), events=('end',))
for event, elem in context:
if elem.tag != 'note':
continue
# process note...
elem.clear() # free memory
Each <resource> element's base64 data is decoded and written to a _resources/ subfolder next to the note. The MD5 hash of the decoded bytes is computed:
raw = base64.b64decode(data_elem.text) md5_hash = hashlib.md5(raw).hexdigest()
This hash is used to build a lookup table: hash → (filepath, mime_type). When the ENML content references <en-media hash="abc123" type="image/jpeg">, the converter looks up abc123 in the table and replaces the tag with <img src="Note_resources/photo.jpg">.
This is essentially a content-addressable storage pattern—the same approach Git uses for object storage, where the hash of the content serves as the key.
ENML is a restricted subset of XHTML with custom Evernote-specific tags. The converter uses regex replacements to transform these:
| ENML Tag | HTML Output |
|---|---|
<en-note> | <div class="en-note"> |
<en-media hash="..." type="image/*"> | <img src="Note_resources/file.jpg"> |
<en-media hash="..." type="audio/*"> | <audio controls src="Note_resources/file.m4a"> |
<en-media hash="..." type="application/pdf"> | <a href="Note_resources/file.pdf"> |
<en-todo checked="true"/> | <input type="checkbox" checked disabled> |
<en-todo checked="false"/> | <input type="checkbox" disabled> |
<en-crypt> | [encrypted content] |
| All other HTML | Passed through unchanged |
The "passed through unchanged" part is important—web clippings saved in Evernote contain their original HTML with inline styles, and the converter preserves all of that structure.
File paths in src and href attributes are percent-encoded with urllib.parse.quote(). This was a bug fix—the initial version used html.escape(), which converted apostrophes to ' and broke paths for notes like "Egypt's corrupt decades." Browsers expect percent-encoding in URLs, not HTML entities.
The archive contains 443 audio files (mostly .m4a voice dictations, some .wav). These are transcribed locally using OpenAI Whisper, an open-source speech recognition model.
base (74M parameters, ~150 MB on disk). Runs entirely on the local CPU—no API calls, no data leaves the machine.ffmpeg under the hood to decode .m4a and .wav files into raw audio.For each audio file, two things happen:
.txt file is saved alongside the audio (e.g., audio.m4a → audio.txt)<details> block is injected into the parent HTML file below the <audio> tag:<audio controls src="Note_resources/audio.m4a"></audio> <details class="transcription"> <summary>Transcription</summary> <p>The transcribed text appears here...</p> </details>
| Metric | Value |
|---|---|
| Audio files processed | 443 |
| Successfully transcribed | 437 (6 were empty/silent) |
| Total words transcribed | 269,295 |
| Runtime (CPU, base model) | ~57 minutes |
Whisper offers larger models (small, medium, large) with better accuracy, especially for accented speech or noisy recordings. To re-transcribe, change the model name in convert.py and re-run Phase 2. Trade-off: the large model is ~10x slower than base on CPU.
The final phase walks the notes/ directory, discovers the notebook/stack structure, and generates index.html with:
| Dependency | Phase | Purpose | Install |
|---|---|---|---|
| Python 3 | All | Runtime | (system) |
xml.etree.ElementTree | 1 | XML streaming parser | stdlib |
hashlib | 1 | MD5 hash for resource matching | stdlib |
base64 | 1 | Decode resource data | stdlib |
urllib.parse | 1, 3 | Percent-encode file paths | stdlib |
openai-whisper | 2 | Speech-to-text transcription | pip install openai-whisper |
ffmpeg | 2 | Audio decoding (used by Whisper) | brew install ffmpeg |
Phases 1 and 3 use only the Python standard library—no pip packages required.
The static HTML archive preserves the data. The Django application will make it usable—searchable, self-organizing, and extensible for future work.
Two options for seeding the database:
en_backup.db — Evernote's SQLite database, likely already has a relational schema with notes, notebooks, tags. Can be explored with sqlite3 en_backup.db ".tables" and .schema.convert.py can be adapted directly.class Tag(models.Model):
"""Tags on notes. Notebook names are auto-imported as tags."""
name = models.CharField(max_length=255, unique=True)
is_notebook = models.BooleanField(default=False)
# is_notebook=True means this tag was generated from an
# Evernote notebook name. It's metadata, not organizational.
class Category(models.Model):
"""Emergent categories generated from weighted content analysis.
These are NOT notebook names. They are algorithmically derived."""
name = models.CharField(max_length=255, unique=True)
weight = models.FloatField(default=0.0, db_index=True)
# Higher weight = more prominent on the index grid
class Note(models.Model):
title = models.CharField(max_length=500)
created = models.DateTimeField()
updated = models.DateTimeField(null=True)
last_accessed = models.DateTimeField(auto_now=True, db_index=True)
# ^ Noguchi: updated on every view, drives sort order
content_html = models.TextField()
source = models.CharField(max_length=100, blank=True)
source_url = models.URLField(max_length=2000, blank=True)
latitude = models.FloatField(null=True, blank=True)
longitude = models.FloatField(null=True, blank=True)
has_audio = models.BooleanField(default=False)
tags = models.ManyToManyField(Tag, blank=True, related_name='notes')
categories = models.ManyToManyField(Category, blank=True,
related_name='notes')
class Resource(models.Model):
note = models.ForeignKey(Note, on_delete=models.CASCADE,
related_name='resources')
md5_hash = models.CharField(max_length=32, db_index=True)
mime_type = models.CharField(max_length=100)
original_filename = models.CharField(max_length=255, blank=True)
file = models.FileField(upload_to='resources/')
class Transcription(models.Model):
resource = models.OneToOneField(Resource, on_delete=models.CASCADE)
text = models.TextField()
word_count = models.IntegerField()
model_used = models.CharField(max_length=50, default='base')
class StopTerm(models.Model):
"""Terms excluded from tagging and category generation."""
term = models.CharField(max_length=200, unique=True)
reason = models.CharField(max_length=200, blank=True)
# e.g. "auto-generated", "too generic", "noise"
Categories are generated algorithmically, not assigned manually. The process:
Untitled, Note, Miscellaneous) and editable over time without touching code.Note followed by optional numbers, Evernote timestamp patterns (Evernote \d{8}), strings that are just URLs.The StopTerm table doubles as a log of why something was excluded—useful when tuning the system later.
The homepage renders as a grid of category cells, sorted by category weight (heaviest categories most prominent). Each cell contains:
last_accessed descending (Noguchi)Every time a user views a note, last_accessed updates and the note floats to the top of its category on the index. Notes that haven't been accessed in months sink to the bottom and eventually off the visible list.
SearchVector / SearchRank across note content and transcriptions