All four deal with messy data. Each project taught me something the next one needed. The arc isn't a plan — it's what happened.
01ChainTaxFeb–Mar 2026
Started with a crypto tax engine. The work surfaced the real problem: reconciling data across exchanges, wallets, and chains is enormously painful, and the difficulty has nothing to do with tax law. Messy data is the engineering problem.
02AetherApr 2026
Took the messy-data lesson into LLM territory. Built a workflow engine and retrieval pipeline that handles structurally inconsistent financial documents — hybrid retrieval, agent loops, audit trails, and cost-tiered model routing engineered against a real eval suite.
03Polymarket AutopsyApr–Jun 2026
Applied the LLM-workflow toolkit at scale: a 3-layer classification pipeline over thousands of trader wallets, feeding 180 paper bots through millions of simulated trades. The bots traded real capital. The autopsy documents why paper performance was anti-predictive of live.
04vLLM Retrieval ForensicsIn Progress, 2026
Combines the previous three: the LLM workflow from Aether, the forensic methodology from Polymarket, applied to LLM-based code retrieval over the vLLM codebase. Stage-by-stage failure documentation as the project proceeds.
Projects
In conviction order
Polymarket Trading Bot Autopsy
Apr–Jun 2026
A 45-day systematic trading bot project documenting how measurement bugs made paper backtests anti-predictive of live performance.
What it proves
Forensic methodology applied to a real production system — from data pipeline, to LLM-driven analysis, to live execution, to public technical writeup. Built the system, ran it, broke it, documented why.
Key findings
Paper PnL inflated ~135× by three measurement bug classes
The bots my paper trading flagged as best performed worst live
Hybrid RAG pipeline over financial compliance documents with a planner/executor/critic agent loop and full audit trails.
What it proves
Retrieval architecture decisions made deliberately — sparse + dense fusion, cross-encoder reranking, cost-tiered model routing — measured against a real eval suite with perfect reproducibility on the retrieval side.
Key findings
Retrieval precision: 96% on the eval suite, perfectly reproducible across 5 runs
End-to-end correctness 87% at $0.65 per run
Tiered model routing: Opus plans, Haiku critiques, executor LLM-free
Cross-source data ingestion and reconciliation pipeline turning hundreds of thousands of crypto events into auditable IRS filings.
What it proves
Deterministic data engineering on adversarially messy inputs — multiple APIs, multiple chains, multiple semantic conventions — collapsed into a single reproducible pipeline with auditable output.
Key findings
540K+ events processed in a single pipeline run
Cross-chain bridge detection prevents double-counting the same dollar across chains
FIFO lot matching with SpecID retrospective comparison
Three competing IRS funding treatments produce three Form 8949 variants per run
Section 1092 offsetting-position detection across spot, perp, and correlated pairs
Forensic study of retrieval and answer-generation over the vLLM codebase — extending the polymarket autopsy methodology to LLM infrastructure systems.
What it proves
The forensic measurement methodology generalizes from trading systems to LLM systems themselves. Project planning published; implementation underway with first case study expected within 4–6 weeks.