Tech Stack

How the archive was extracted from Evernote and converted into a browsable HTML site.

Pipeline Overview

Evernote Desktop App | | native export (File → Export Notebooks) v 112 .enex files (4.7 GB XML + base64 resources) en_backup.db (SQLite) | | convert.py (Phase 1: iterparse + regex) v 4,344 HTML files + 16,543 extracted resources | | convert.py (Phase 2: openai-whisper, local) v 437 audio transcriptions (.txt + injected into HTML) | | convert.py (Phase 3) v index.html, guide.html, changelog.html

The entire conversion runs locally. No APIs, no cloud services, no network requests. The output is static HTML with relative paths—no server needed to browse it.

Step 1: Export from Evernote

Evernote's desktop app (v10.134.4) provides a native export feature. Selecting notebooks and choosing Export as .enex produces XML files that contain every note with its full content and resources.

The export created two artifacts:

File	Format	Contents
`export/*.enex`	XML (Evernote Export)	Notes, ENML content, base64-encoded resources, metadata
`en_backup.db`	SQLite 3	Evernote's internal database (notes, notebooks, tags, etc.)

The .enex Format

An .enex file is XML with this structure:

<?xml version="1.0" encoding="UTF-8"?>
<en-export application="Evernote" version="10.134.4">
  <note>
    <title>Note Title</title>
    <created>20120803T061625Z</created>
    <updated>20120803T061637Z</updated>
    <note-attributes>
      <source>web.clip</source>
      <source-url>https://example.com/article</source-url>
      <latitude>37.882</latitude>
      <longitude>-122.287</longitude>
    </note-attributes>
    <content><![CDATA[
      <?xml version="1.0" encoding="UTF-8"?>
      <en-note>...ENML content...</en-note>
    ]]></content>
    <resource>
      <data encoding="base64">/9j/4AAQ...</data>
      <mime>image/jpeg</mime>
      <width>462</width>
      <height>308</height>
      <resource-attributes>
        <file-name>photo.jpg</file-name>
      </resource-attributes>
    </resource>
  </note>
  <!-- ...more notes... -->
</en-export>

Key points:

Note content is ENML (Evernote Markup Language) wrapped in a CDATA section
Resources (images, audio, PDFs) are base64-encoded inline—a single .enex file for a photo-heavy notebook can be 700 MB+
The <resource> has no explicit hash attribute; the hash must be computed from the decoded bytes
The ENML content references resources by MD5 hash via <en-media hash="..."> tags

The en_backup.db SQLite Database

Evernote also exports its internal SQLite database. This is a separate artifact from the .enex files and has not been used in the current conversion. It likely contains a relational schema with tables for notes, notebooks, tags, and their relationships—potentially useful for a future Django migration.

Step 2: convert.py — Parse & Extract (Phase 1)

Streaming XML Parser

The script uses Python's built-in xml.etree.ElementTree.iterparse to stream-parse each .enex file. This is critical because several files are enormous:

File	Size
`PB1099.enex`	1.7 GB (467 notes)
`All Instagram Posts.enex`	669 MB (67 notes)
`All Photography.enex`	505 MB (249 notes)

Loading these into a DOM tree would consume many gigabytes of RAM. Instead, iterparse fires events as elements are closed, and elem.clear() frees each note after processing. Think of it as the XML equivalent of Django's StreamingHttpResponse—constant memory regardless of file size.

context = ET.iterparse(str(enex_path), events=('end',))
for event, elem in context:
    if elem.tag != 'note':
        continue
    # process note...
    elem.clear()  # free memory

Resource Extraction & Content-Addressable Lookup

Each <resource> element's base64 data is decoded and written to a _resources/ subfolder next to the note. The MD5 hash of the decoded bytes is computed:

raw = base64.b64decode(data_elem.text)
md5_hash = hashlib.md5(raw).hexdigest()

This hash is used to build a lookup table: hash → (filepath, mime_type). When the ENML content references <en-media hash="abc123" type="image/jpeg">, the converter looks up abc123 in the table and replaces the tag with <img src="Note_resources/photo.jpg">.

This is essentially a content-addressable storage pattern—the same approach Git uses for object storage, where the hash of the content serves as the key.

ENML to HTML Conversion

ENML is a restricted subset of XHTML with custom Evernote-specific tags. The converter uses regex replacements to transform these:

ENML Tag	HTML Output
`<en-note>`	`<div class="en-note">`
`<en-media hash="..." type="image/*">`	`<img src="Note_resources/file.jpg">`
`<en-media hash="..." type="audio/*">`	`<audio controls src="Note_resources/file.m4a">`
`<en-media hash="..." type="application/pdf">`	`<a href="Note_resources/file.pdf">`
`<en-todo checked="true"/>`	`<input type="checkbox" checked disabled>`
`<en-todo checked="false"/>`	`<input type="checkbox" disabled>`
`<en-crypt>`	`[encrypted content]`
All other HTML	Passed through unchanged

The "passed through unchanged" part is important—web clippings saved in Evernote contain their original HTML with inline styles, and the converter preserves all of that structure.

Path Encoding

File paths in src and href attributes are percent-encoded with urllib.parse.quote(). This was a bug fix—the initial version used html.escape(), which converted apostrophes to ' and broke paths for notes like "Egypt's corrupt decades." Browsers expect percent-encoding in URLs, not HTML entities.

Step 3: convert.py — Whisper Transcription (Phase 2)

The archive contains 443 audio files (mostly .m4a voice dictations, some .wav). These are transcribed locally using OpenAI Whisper, an open-source speech recognition model.

How It Works

Model: Whisper base (74M parameters, ~150 MB on disk). Runs entirely on the local CPU—no API calls, no data leaves the machine.
Audio decoding: Whisper uses ffmpeg under the hood to decode .m4a and .wav files into raw audio.
Output: Plain text transcription for each file.

Where Transcriptions Go

For each audio file, two things happen:

A .txt file is saved alongside the audio (e.g., audio.m4a → audio.txt)
A <details> block is injected into the parent HTML file below the <audio> tag:

<audio controls src="Note_resources/audio.m4a"></audio>
<details class="transcription">
  <summary>Transcription</summary>
  <p>The transcribed text appears here...</p>
</details>

Results

Metric	Value
Audio files processed	443
Successfully transcribed	437 (6 were empty/silent)
Total words transcribed	269,295
Runtime (CPU, base model)	~57 minutes

Upgrading

Whisper offers larger models (small, medium, large) with better accuracy, especially for accented speech or noisy recordings. To re-transcribe, change the model name in convert.py and re-run Phase 2. Trade-off: the large model is ~10x slower than base on CPU.

Step 4: convert.py — Index Generation (Phase 3)

The final phase walks the notes/ directory, discovers the notebook/stack structure, and generates index.html with:

Notebooks grouped by stack
Note count per notebook
Clickable links to every note (percent-encoded paths)
🎤 emoji tags on notes that contain audio resources (222 of 4,344)
Links to the guide, changelog, and tech stack pages in the header

Dependencies

Dependency	Phase	Purpose	Install
Python 3	All	Runtime	(system)
`xml.etree.ElementTree`	1	XML streaming parser	stdlib
`hashlib`	1	MD5 hash for resource matching	stdlib
`base64`	1	Decode resource data	stdlib
`urllib.parse`	1, 3	Percent-encode file paths	stdlib
`openai-whisper`	2	Speech-to-text transcription	`pip install openai-whisper`
`ffmpeg`	2	Audio decoding (used by Whisper)	`brew install ffmpeg`

Phases 1 and 3 use only the Python standard library—no pip packages required.

Next Phase: Django + Postgres

The static HTML archive preserves the data. The Django application will make it usable—searchable, self-organizing, and extensible for future work.

Design Principles

Organization by weight, not filing — The current archive mirrors Evernote's notebook/stack hierarchy. The Django version replaces this with emergent categories generated from content analysis. No human-assigned filing system.
Noguchi self-organization — Within each category, notes are ordered by last accessed time. Active notes float to the top; cold notes sink. Zero maintenance.
Notebook names are tags, not structure — The original notebook name is preserved as a tag on each note, but it has no influence on the category hierarchy. A notebook called "The Flight of Horace" only becomes a category if that phrase carries weight in note titles and content—not because it was a notebook.
Yahoo directory-style index — The homepage is a grid of supercategories, each showing its top 10 most recently accessed notes. Modeled after Yahoo's late-1990s directory layout.

Data Source

Two options for seeding the database:

en_backup.db — Evernote's SQLite database, likely already has a relational schema with notes, notebooks, tags. Can be explored with sqlite3 en_backup.db ".tables" and .schema.
The .enex files — Re-parse with a Django management command that creates model instances instead of writing HTML files. The parsing logic in convert.py can be adapted directly.

Models

class Tag(models.Model):
    """Tags on notes. Notebook names are auto-imported as tags."""
    name = models.CharField(max_length=255, unique=True)
    is_notebook = models.BooleanField(default=False)
    # is_notebook=True means this tag was generated from an
    # Evernote notebook name. It's metadata, not organizational.

class Category(models.Model):
    """Emergent categories generated from weighted content analysis.
    These are NOT notebook names. They are algorithmically derived."""
    name = models.CharField(max_length=255, unique=True)
    weight = models.FloatField(default=0.0, db_index=True)
    # Higher weight = more prominent on the index grid

class Note(models.Model):
    title = models.CharField(max_length=500)
    created = models.DateTimeField()
    updated = models.DateTimeField(null=True)
    last_accessed = models.DateTimeField(auto_now=True, db_index=True)
    # ^ Noguchi: updated on every view, drives sort order
    content_html = models.TextField()
    source = models.CharField(max_length=100, blank=True)
    source_url = models.URLField(max_length=2000, blank=True)
    latitude = models.FloatField(null=True, blank=True)
    longitude = models.FloatField(null=True, blank=True)
    has_audio = models.BooleanField(default=False)
    tags = models.ManyToManyField(Tag, blank=True, related_name='notes')
    categories = models.ManyToManyField(Category, blank=True,
                                         related_name='notes')

class Resource(models.Model):
    note = models.ForeignKey(Note, on_delete=models.CASCADE,
                             related_name='resources')
    md5_hash = models.CharField(max_length=32, db_index=True)
    mime_type = models.CharField(max_length=100)
    original_filename = models.CharField(max_length=255, blank=True)
    file = models.FileField(upload_to='resources/')

class Transcription(models.Model):
    resource = models.OneToOneField(Resource, on_delete=models.CASCADE)
    text = models.TextField()
    word_count = models.IntegerField()
    model_used = models.CharField(max_length=50, default='base')

class StopTerm(models.Model):
    """Terms excluded from tagging and category generation."""
    term = models.CharField(max_length=200, unique=True)
    reason = models.CharField(max_length=200, blank=True)
    # e.g. "auto-generated", "too generic", "noise"

Category Generation: TF-IDF + Noise Filtering

Categories are generated algorithmically, not assigned manually. The process:

Extract terms from note titles and content across the entire corpus
Filter noise using three layers:
- StopTerm table — Curated exclusion list stored in the database. Seeded with obvious junk (Untitled, Note, Miscellaneous) and editable over time without touching code.
- Pattern filter — Regex to catch auto-generated names before they enter the analysis: bare Note followed by optional numbers, Evernote timestamp patterns (Evernote \d{8}), strings that are just URLs.
- TF-IDF scoring — Terms that appear in every note get near-zero IDF and are naturally suppressed. Terms that cluster meaningfully in a subset of notes get high scores.
Generate categories from the highest-weighted terms/phrases
Assign notes to categories based on which terms appear in their title/content

The StopTerm table doubles as a log of why something was excluded—useful when tuning the system later.

Index View: Yahoo Directory + Noguchi

The homepage renders as a grid of category cells, sorted by category weight (heaviest categories most prominent). Each cell contains:

Category name (with weight-based font sizing, like a tag cloud)
Top 10 notes in that category, sorted by last_accessed descending (Noguchi)
Note count

Every time a user views a note, last_accessed updates and the note floats to the top of its category on the index. Notes that haven't been accessed in months sink to the bottom and eventually off the visible list.

What Django Enables

Full-text search — Postgres SearchVector / SearchRank across note content and transcriptions
Self-organizing index — Noguchi-sorted notes within weight-based categories, no manual filing
Tag filtering — Filter by original notebook name, date range, has-audio, has-transcription
Map view — Many notes have lat/lon coordinates from Evernote's geolocation
Transcription management — Re-transcribe individual files, compare model outputs, edit transcriptions
Category tuning — Add/remove stop terms, re-run category generation, adjust weights
REST API — Django REST Framework for programmatic access
Future content — New notes (not from Evernote) can be added and automatically categorized by the same weight system