Methodology
How Nexus links federal contract data to public company information
Overview
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ SEC EDGAR │ │ SAM.gov │ │ USAspending │
│ (Public Cos) │ │ (Fed Vendors) │ │ (Fed Contracts) │
│ │ │ │ │ │
│ CIK + Ticker │ │ UEI │ │ UEI │
│ 89K companies │ │ 876K entities │ │ 179M awards │
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
│ │ │
└───────────┬───────────┴───────────┬───────────┘
│ │
▼ ▼
┌─────────────────────────────────────┐
│ ENTITY CROSSWALK │
│ SEC CIK ↔ SAM UEI ↔ USAspending │
└─────────────────────────────────────┘
│
▼
NEXUS UIData Sources
SEC EDGAR
Public company filings
CIK, company name, tickers, SIC code, state, EIN
Daily
SAM.gov
Federal vendor registry
UEI, CAGE code, legal name, DBA, address
Monthly
USAspending
Federal contract awards
Award ID, recipient UEI, agency, amount, dates
Daily
Entity Resolution
The Problem
No common identifier exists across all three sources. SEC uses CIK, SAM/USAspending use UEI, and company names are inconsistent.
"Apple Inc.""APPLE INC""APPLE COMPUTER INC"Step 1: Name Normalization
Standardize company names by removing legal suffixes, punctuation, and converting to uppercase.
"Apple Inc."→"APPLE""Microsoft Corporation"→"MICROSOFT"Step 2: Blocking
Group by first 3 characters to reduce comparison pairs from 78 billion to ~50 million.
Step 3: Trigram Similarity
Use PostgreSQL's pg_trgm extension for fuzzy matching. Match threshold: similarity > 0.7
Step 4: Score Boosting
Additional signals improve accuracy: same state (+0.1), same EIN (+0.2), exact match (1.0)
Expected Results
Data Refresh Strategy
| Source | Frequency | Method |
|---|---|---|
| SEC EDGAR | Weekly | Delta update from submissions.zip |
| SAM.gov | Monthly | Full refresh from monthly extract |
| USAspending | Monthly | Incremental award updates |
| Crosswalk | Monthly | Re-run after source updates |
Limitations
- •Fuzzy matching is imperfect — Some false positives/negatives will exist
- •Subsidiaries — SEC data is at parent level; contracts may be to subsidiaries
- •Name changes — Companies change names; former_names field helps but isn't complete
- •International — Focus is US federal contracts; international subsidiaries may be missed
- •Private companies — SEC only covers public companies; many large contractors are private