Back to Search

Methodology

How Nexus links federal contract data to public company information

Overview

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   SEC EDGAR     │     │    SAM.gov      │     │  USAspending    │
│  (Public Cos)   │     │ (Fed Vendors)   │     │ (Fed Contracts) │
│                 │     │                 │     │                 │
│  CIK + Ticker   │     │      UEI        │     │      UEI        │
│  89K companies  │     │  876K entities  │     │  179M awards    │
└────────┬────────┘     └────────┬────────┘     └────────┬────────┘
         │                       │                       │
         └───────────┬───────────┴───────────┬───────────┘
                     │                       │
                     ▼                       ▼
              ┌─────────────────────────────────────┐
              │         ENTITY CROSSWALK            │
              │  SEC CIK ↔ SAM UEI ↔ USAspending   │
              └─────────────────────────────────────┘
                              │
                              ▼
                         NEXUS UI

Data Sources

SEC EDGAR

Public company filings

89K companies
Key Fields:

CIK, company name, tickers, SIC code, state, EIN

Update Frequency:

Daily

SAM.gov

Federal vendor registry

876K entities
Key Fields:

UEI, CAGE code, legal name, DBA, address

Update Frequency:

Monthly

USAspending

Federal contract awards

179M awards
Key Fields:

Award ID, recipient UEI, agency, amount, dates

Update Frequency:

Daily

Entity Resolution

The Problem

No common identifier exists across all three sources. SEC uses CIK, SAM/USAspending use UEI, and company names are inconsistent.

"Apple Inc.""APPLE INC""APPLE COMPUTER INC"

Step 1: Name Normalization

Standardize company names by removing legal suffixes, punctuation, and converting to uppercase.

"Apple Inc.""APPLE"
"Microsoft Corporation""MICROSOFT"

Step 2: Blocking

Group by first 3 characters to reduce comparison pairs from 78 billion to ~50 million.

Block "APP": Apple, Applied Materials, Applebee's...

Step 3: Trigram Similarity

Use PostgreSQL's pg_trgm extension for fuzzy matching. Match threshold: similarity > 0.7

similarity('APPLE', 'APPLE') = 1.00
similarity('MICROSOFT', 'MICROSFT') = 0.82

Step 4: Score Boosting

Additional signals improve accuracy: same state (+0.1), same EIN (+0.2), exact match (1.0)

Expected Results

10-20K
SEC ↔ SAM matches
85-90%
Overall accuracy
500K+
SAM ↔ USAspending (UEI exact)
<5%
False positive rate

Data Refresh Strategy

SourceFrequencyMethod
SEC EDGARWeeklyDelta update from submissions.zip
SAM.govMonthlyFull refresh from monthly extract
USAspendingMonthlyIncremental award updates
CrosswalkMonthlyRe-run after source updates

Limitations

  • Fuzzy matching is imperfect — Some false positives/negatives will exist
  • Subsidiaries — SEC data is at parent level; contracts may be to subsidiaries
  • Name changes — Companies change names; former_names field helps but isn't complete
  • International — Focus is US federal contracts; international subsidiaries may be missed
  • Private companies — SEC only covers public companies; many large contractors are private