001.001 Data Sources

To perform a longitudinal analysis of age group swimming, we require a representative sample of age group swimmer performance at a minimum resolution of one year. One year should be sufficient to capture the essential dynamics of age group swimming, namely that it has two very different seasons and each season culminates in championship meets. The two seasons are short course (September to April), where competition primarily occurs in 25 yard lengths, and long course (May to August), where competition primarily occurs in 50 meter lengths.

Data Sources

Accordingly, we consider a data source complete if it provides a nearly complete set of results for each year that it provides results. We want to be sure our data includes representative results for USA-S swimmers of all ages, genders, and abilities. Our data is limited to the following 18 Local Swim Committees (LSCs) and 1 Zone that provide the most complete results:


Zone

LSC

LSC-Name

Meets

Years

First

Last

W

CA

Southern California Swimming

1075

14

2005

2018

W

CC

Central California Swimming

255

12

2007

2018

E

EZ

Eastern Zone

86

12

2007

2018

S

FG

Florida Gold Coast Swimming

964

14

2005

2018

S

FL

Florida Swimming

2027

14

2005

2018

C

LE

Lake Erie Swimming

665

14

2005

2018

E

MA

Middle Atlantic Swimming

1124

14

2005

2018

E

MD

Maryland Swimming

550

14

2005

2018

C

MI

Michigan Swimming

271

5

2008

2012

E

MR

Metropolitan Swimming

100

1

2014

2014

W

MT

Montana Swimming

311

11

2008

2018

E

NI

Niagara Swimming

1271

10

2009

2018

E

NJ

New Jersey Swimming

342

5

2014

2018

S

NT

North Texas Swimming

638

14

2005

2018

W

OR

Oregon Swimming

1486

14

2005

2018

W

PC

Pacific Swimming

104

6

2007

2012

E

PV

Potomac Valley Swimming

822

10

2009

2018

W

SI

San Diego-Imperial Swimming

317

11

2008

2018

E

VA

Virginia Swimming

3020

14

2005

2018


Together, these meet results files record over 25 million swims from over 15 thousand meets:


Meets

15,009

Days

33,999

Events

971,388

Heats

3,777,348

Swims

25,807,449

Data Cleanup

The USA-S meet results files are rich source of data, but like any large data set, they are imperfect. The files contain errors, duplicates, and omissions. The files also contain biases that may taint our subsequent analysis, such as including foreign athletes or YMCA athletes. The goal of our analysis is to better understand age group swimming, as sanctioned by USA-S. To that end, we must limit our analysis to swims by USA-S athletes, in the events that such swimmers participate in.


To improve the quality of our subsequent analysis, we

  1. Eliminate duplicate records.

  2. Include relay leadoff legs when credible relay splits are available. Relay splits are considered credible when they are a multiple of four, they are monotonically increasing, and the final split agrees with the final relay time.

  3. Exclude swims whose results indicate a near certain timing system malfunction. A malfunction is indicated when plunger times are available but none of them is within 20% of the touchpad time.

  4. Exclude swims whose times are faster than the US National Age Group Records.

  5. Exclude swims for which USA-S does not maintain national age group records.

  6. Exclude swims made by athletes without a valid USA-S identifier.

  7. Exclude swims by athletes younger than 6 or older than 18.


Here are rough statistics on this filtering.

Reason

Times

Swimmer age out of range

662,261

Missing or invalid USA-S ID

296,235

Timing system malfunction

4,839

Impossibly fast time

2,076

Non-standard course (SCM)

239,471

Non-recognized event

2,218,933

Total

2,261,311


The single largest effect is excluding non-recognized events, from which we lose over 2 million swims. The recognized events for USA-S age group swimming are listed in Article 102.1.2. The most common non-recognized events for USA-S are all 25 yard swims, which are common for 8/Unders, and shorter distances for older swimmers, such as the 50 backstroke, 50 breaststroke, 50 butterfly, and 100 IM for 13/Overs.


After filtering, we’re left with 22,904,416 final times and 728,575 disqualifications across 14 years from 473,878 distinct athletes and 1,197,765 athlete-years.

Data Source Biases.

Career Censoring. Our data sources provide varying amounts of historical data. The career of an age group swimmer may span 12 years or more. A source that provides fewer years of data may capture only a portion of an athlete’s career swims, while a source that provides more years of data may capture all of an athlete’s career swims. To overcome this mismatch in data sources, we’ll consider athletes in combination with age, season, or calendar year.  By doing this, we count each annual instance of an athlete separately. An athlete with a 12 year career would count at most 3 times from a 3 year data source and at most 12 times from a 20 year data source.


Inter-LSC Participation. Age group athletes occasionally compete across LSC boundaries.  When competition crosses between one of the 18 LSCs whose data we used (listed above) and one of the remaining 35 LSCs whose data we didn’t use, it creates a bias in our data set. The main effect is to undercount swims per athlete because our data set does not include all the swims made by the athletes in our data set.  The effect is exaggerated for athletes at the top of their age group (ages 10, 12, 14, 18), who are more likely to attend inter-LSC championship meets, such as Zones. The simplest way to address this bias is to exclude all swims by athletes outside the 18 data source LSCs. This causes its own problems, because swims are not reliably labeled with the athlete’s LSC.


Recognized Events. To match the USA-S SWIMS database, our data excludes 2.2 million swims from non-recognized events, such as 25 yard swims for all ages and 50 Back/Breast/Fly for 13/Overs.  Excluding non-recognized events lowers participation measures for 8/Unders, who are most likely to swim 25s, and for casual 13/Overs who are more likely to swim 50s.


In the next post, we’ll explore what’s in that data, and what it says about USA-S’s age group program (link).