Tech Stack

How the archive was extracted from Evernote and converted into a browsable HTML site.

← Back to index  ·  Guide  ·  Changelog

Pipeline Overview

Evernote Desktop App | | native export (File → Export Notebooks) v 112 .enex files (4.7 GB XML + base64 resources) en_backup.db (SQLite) | | convert.py (Phase 1: iterparse + regex) v 4,344 HTML files + 16,543 extracted resources | | convert.py (Phase 2: openai-whisper, local) v 437 audio transcriptions (.txt + injected into HTML) | | convert.py (Phase 3) v index.html, guide.html, changelog.html

The entire conversion runs locally. No APIs, no cloud services, no network requests. The output is static HTML with relative paths—no server needed to browse it.

Step 1: Export from Evernote

Evernote's desktop app (v10.134.4) provides a native export feature. Selecting notebooks and choosing Export as .enex produces XML files that contain every note with its full content and resources.

The export created two artifacts:

FileFormatContents
export/*.enexXML (Evernote Export)Notes, ENML content, base64-encoded resources, metadata
en_backup.dbSQLite 3Evernote's internal database (notes, notebooks, tags, etc.)

The .enex Format

An .enex file is XML with this structure:

<?xml version="1.0" encoding="UTF-8"?>
<en-export application="Evernote" version="10.134.4">
  <note>
    <title>Note Title</title>
    <created>20120803T061625Z</created>
    <updated>20120803T061637Z</updated>
    <note-attributes>
      <source>web.clip</source>
      <source-url>https://example.com/article</source-url>
      <latitude>37.882</latitude>
      <longitude>-122.287</longitude>
    </note-attributes>
    <content><![CDATA[
      <?xml version="1.0" encoding="UTF-8"?>
      <en-note>...ENML content...</en-note>
    ]]></content>
    <resource>
      <data encoding="base64">/9j/4AAQ...</data>
      <mime>image/jpeg</mime>
      <width>462</width>
      <height>308</height>
      <resource-attributes>
        <file-name>photo.jpg</file-name>
      </resource-attributes>
    </resource>
  </note>
  <!-- ...more notes... -->
</en-export>

Key points:

The en_backup.db SQLite Database

Evernote also exports its internal SQLite database. This is a separate artifact from the .enex files and has not been used in the current conversion. It likely contains a relational schema with tables for notes, notebooks, tags, and their relationships—potentially useful for a future Django migration.

Step 2: convert.py — Parse & Extract (Phase 1)

Streaming XML Parser

The script uses Python's built-in xml.etree.ElementTree.iterparse to stream-parse each .enex file. This is critical because several files are enormous:

FileSize
PB1099.enex1.7 GB (467 notes)
All Instagram Posts.enex669 MB (67 notes)
All Photography.enex505 MB (249 notes)

Loading these into a DOM tree would consume many gigabytes of RAM. Instead, iterparse fires events as elements are closed, and elem.clear() frees each note after processing. Think of it as the XML equivalent of Django's StreamingHttpResponse—constant memory regardless of file size.

context = ET.iterparse(str(enex_path), events=('end',))
for event, elem in context:
    if elem.tag != 'note':
        continue
    # process note...
    elem.clear()  # free memory

Resource Extraction & Content-Addressable Lookup

Each <resource> element's base64 data is decoded and written to a _resources/ subfolder next to the note. The MD5 hash of the decoded bytes is computed:

raw = base64.b64decode(data_elem.text)
md5_hash = hashlib.md5(raw).hexdigest()

This hash is used to build a lookup table: hash → (filepath, mime_type). When the ENML content references <en-media hash="abc123" type="image/jpeg">, the converter looks up abc123 in the table and replaces the tag with <img src="Note_resources/photo.jpg">.

This is essentially a content-addressable storage pattern—the same approach Git uses for object storage, where the hash of the content serves as the key.

ENML to HTML Conversion

ENML is a restricted subset of XHTML with custom Evernote-specific tags. The converter uses regex replacements to transform these:

ENML TagHTML Output
<en-note><div class="en-note">
<en-media hash="..." type="image/*"><img src="Note_resources/file.jpg">
<en-media hash="..." type="audio/*"><audio controls src="Note_resources/file.m4a">
<en-media hash="..." type="application/pdf"><a href="Note_resources/file.pdf">
<en-todo checked="true"/><input type="checkbox" checked disabled>
<en-todo checked="false"/><input type="checkbox" disabled>
<en-crypt>[encrypted content]
All other HTMLPassed through unchanged

The "passed through unchanged" part is important—web clippings saved in Evernote contain their original HTML with inline styles, and the converter preserves all of that structure.

Path Encoding

File paths in src and href attributes are percent-encoded with urllib.parse.quote(). This was a bug fix—the initial version used html.escape(), which converted apostrophes to &#x27; and broke paths for notes like "Egypt's corrupt decades." Browsers expect percent-encoding in URLs, not HTML entities.

Step 3: convert.py — Whisper Transcription (Phase 2)

The archive contains 443 audio files (mostly .m4a voice dictations, some .wav). These are transcribed locally using OpenAI Whisper, an open-source speech recognition model.

How It Works

Where Transcriptions Go

For each audio file, two things happen:

  1. A .txt file is saved alongside the audio (e.g., audio.m4aaudio.txt)
  2. A <details> block is injected into the parent HTML file below the <audio> tag:
<audio controls src="Note_resources/audio.m4a"></audio>
<details class="transcription">
  <summary>Transcription</summary>
  <p>The transcribed text appears here...</p>
</details>

Results

MetricValue
Audio files processed443
Successfully transcribed437 (6 were empty/silent)
Total words transcribed269,295
Runtime (CPU, base model)~57 minutes

Upgrading

Whisper offers larger models (small, medium, large) with better accuracy, especially for accented speech or noisy recordings. To re-transcribe, change the model name in convert.py and re-run Phase 2. Trade-off: the large model is ~10x slower than base on CPU.

Step 4: convert.py — Index Generation (Phase 3)

The final phase walks the notes/ directory, discovers the notebook/stack structure, and generates index.html with:

Dependencies

DependencyPhasePurposeInstall
Python 3AllRuntime(system)
xml.etree.ElementTree1XML streaming parserstdlib
hashlib1MD5 hash for resource matchingstdlib
base641Decode resource datastdlib
urllib.parse1, 3Percent-encode file pathsstdlib
openai-whisper2Speech-to-text transcriptionpip install openai-whisper
ffmpeg2Audio decoding (used by Whisper)brew install ffmpeg

Phases 1 and 3 use only the Python standard library—no pip packages required.

Next Phase: Django + Postgres

The static HTML archive preserves the data. The Django application will make it usable—searchable, self-organizing, and extensible for future work.

Design Principles

  1. Organization by weight, not filing — The current archive mirrors Evernote's notebook/stack hierarchy. The Django version replaces this with emergent categories generated from content analysis. No human-assigned filing system.
  2. Noguchi self-organization — Within each category, notes are ordered by last accessed time. Active notes float to the top; cold notes sink. Zero maintenance.
  3. Notebook names are tags, not structure — The original notebook name is preserved as a tag on each note, but it has no influence on the category hierarchy. A notebook called "The Flight of Horace" only becomes a category if that phrase carries weight in note titles and content—not because it was a notebook.
  4. Yahoo directory-style index — The homepage is a grid of supercategories, each showing its top 10 most recently accessed notes. Modeled after Yahoo's late-1990s directory layout.

Data Source

Two options for seeding the database:

  1. en_backup.db — Evernote's SQLite database, likely already has a relational schema with notes, notebooks, tags. Can be explored with sqlite3 en_backup.db ".tables" and .schema.
  2. The .enex files — Re-parse with a Django management command that creates model instances instead of writing HTML files. The parsing logic in convert.py can be adapted directly.

Models

class Tag(models.Model):
    """Tags on notes. Notebook names are auto-imported as tags."""
    name = models.CharField(max_length=255, unique=True)
    is_notebook = models.BooleanField(default=False)
    # is_notebook=True means this tag was generated from an
    # Evernote notebook name. It's metadata, not organizational.

class Category(models.Model):
    """Emergent categories generated from weighted content analysis.
    These are NOT notebook names. They are algorithmically derived."""
    name = models.CharField(max_length=255, unique=True)
    weight = models.FloatField(default=0.0, db_index=True)
    # Higher weight = more prominent on the index grid

class Note(models.Model):
    title = models.CharField(max_length=500)
    created = models.DateTimeField()
    updated = models.DateTimeField(null=True)
    last_accessed = models.DateTimeField(auto_now=True, db_index=True)
    # ^ Noguchi: updated on every view, drives sort order
    content_html = models.TextField()
    source = models.CharField(max_length=100, blank=True)
    source_url = models.URLField(max_length=2000, blank=True)
    latitude = models.FloatField(null=True, blank=True)
    longitude = models.FloatField(null=True, blank=True)
    has_audio = models.BooleanField(default=False)
    tags = models.ManyToManyField(Tag, blank=True, related_name='notes')
    categories = models.ManyToManyField(Category, blank=True,
                                         related_name='notes')

class Resource(models.Model):
    note = models.ForeignKey(Note, on_delete=models.CASCADE,
                             related_name='resources')
    md5_hash = models.CharField(max_length=32, db_index=True)
    mime_type = models.CharField(max_length=100)
    original_filename = models.CharField(max_length=255, blank=True)
    file = models.FileField(upload_to='resources/')

class Transcription(models.Model):
    resource = models.OneToOneField(Resource, on_delete=models.CASCADE)
    text = models.TextField()
    word_count = models.IntegerField()
    model_used = models.CharField(max_length=50, default='base')

class StopTerm(models.Model):
    """Terms excluded from tagging and category generation."""
    term = models.CharField(max_length=200, unique=True)
    reason = models.CharField(max_length=200, blank=True)
    # e.g. "auto-generated", "too generic", "noise"

Category Generation: TF-IDF + Noise Filtering

Categories are generated algorithmically, not assigned manually. The process:

  1. Extract terms from note titles and content across the entire corpus
  2. Filter noise using three layers:
  3. Generate categories from the highest-weighted terms/phrases
  4. Assign notes to categories based on which terms appear in their title/content

The StopTerm table doubles as a log of why something was excluded—useful when tuning the system later.

Index View: Yahoo Directory + Noguchi

The homepage renders as a grid of category cells, sorted by category weight (heaviest categories most prominent). Each cell contains:

Every time a user views a note, last_accessed updates and the note floats to the top of its category on the index. Notes that haven't been accessed in months sink to the bottom and eventually off the visible list.

+----------------------+----------------------+----------------------+ | Egypt | Photography | Writing | | (weight: 8.4) | (weight: 7.1) | (weight: 6.8) | | | | | | Gardiner Sign List | Bear Valley Trail | Voice notes ch.1 | | The Nile Cruise | Lomography tips | Character sketches | | Ancient temples | Cyanotype process | Tristan retelling | | ...7 more | ...7 more | ...7 more | +----------------------+----------------------+----------------------+ | Art | History | Tech | | (weight: 5.9) | (weight: 5.2) | (weight: 4.7) | | ... | ... | ... | +----------------------+----------------------+----------------------+

What Django Enables