To perform a longitudinal analysis of age group swimming, we require a representative sample of age group swimmer performance at a minimum resolution of one year. One year should be sufficient to capture the essential dynamics of age group swimming, namely that it has two very different seasons and each season culminates in championship meets. The two seasons are short course (September to April), where competition primarily occurs in 25 yard lengths, and long course (May to August), where competition primarily occurs in 50 meter lengths.
Accordingly, we consider a data source complete if it provides a nearly complete set of results for each year that it provides results. We want to be sure our data includes representative results for USA-S swimmers of all ages, genders, and abilities. Our data is limited to the following 18 Local Swim Committees (LSCs) and 1 Zone that provide the most complete results:
Together, these meet results files record over 25 million swims from over 15 thousand meets:
The USA-S meet results files are rich source of data, but like any large data set, they are imperfect. The files contain errors, duplicates, and omissions. The files also contain biases that may taint our subsequent analysis, such as including foreign athletes or YMCA athletes. The goal of our analysis is to better understand age group swimming, as sanctioned by USA-S. To that end, we must limit our analysis to swims by USA-S athletes, in the events that such swimmers participate in.
To improve the quality of our subsequent analysis, we
Eliminate duplicate records.
Include relay leadoff legs when credible relay splits are available. Relay splits are considered credible when they are a multiple of four, they are monotonically increasing, and the final split agrees with the final relay time.
Exclude swims whose results indicate a near certain timing system malfunction. A malfunction is indicated when plunger times are available but none of them is within 20% of the touchpad time.
Exclude swims whose times are faster than the US National Age Group Records.
Exclude swims for which USA-S does not maintain national age group records.
Exclude swims made by athletes without a valid USA-S identifier.
Exclude swims by athletes younger than 6 or older than 18.
Here are rough statistics on this filtering.
The single largest effect is excluding non-recognized events, from which we lose over 2 million swims. The recognized events for USA-S age group swimming are listed in Article 102.1.2. The most common non-recognized events for USA-S are all 25 yard swims, which are common for 8/Unders, and shorter distances for older swimmers, such as the 50 backstroke, 50 breaststroke, 50 butterfly, and 100 IM for 13/Overs.
After filtering, we’re left with 22,904,416 final times and 728,575 disqualifications across 14 years from 473,878 distinct athletes and 1,197,765 athlete-years.
Career Censoring. Our data sources provide varying amounts of historical data. The career of an age group swimmer may span 12 years or more. A source that provides fewer years of data may capture only a portion of an athlete’s career swims, while a source that provides more years of data may capture all of an athlete’s career swims. To overcome this mismatch in data sources, we’ll consider athletes in combination with age, season, or calendar year. By doing this, we count each annual instance of an athlete separately. An athlete with a 12 year career would count at most 3 times from a 3 year data source and at most 12 times from a 20 year data source.
Inter-LSC Participation. Age group athletes occasionally compete across LSC boundaries. When competition crosses between one of the 18 LSCs whose data we used (listed above) and one of the remaining 35 LSCs whose data we didn’t use, it creates a bias in our data set. The main effect is to undercount swims per athlete because our data set does not include all the swims made by the athletes in our data set. The effect is exaggerated for athletes at the top of their age group (ages 10, 12, 14, 18), who are more likely to attend inter-LSC championship meets, such as Zones. The simplest way to address this bias is to exclude all swims by athletes outside the 18 data source LSCs. This causes its own problems, because swims are not reliably labeled with the athlete’s LSC.
Recognized Events. To match the USA-S SWIMS database, our data excludes 2.2 million swims from non-recognized events, such as 25 yard swims for all ages and 50 Back/Breast/Fly for 13/Overs. Excluding non-recognized events lowers participation measures for 8/Unders, who are most likely to swim 25s, and for casual 13/Overs who are more likely to swim 50s.
In the next post, we’ll explore what’s in that data, and what it says about USA-S’s age group program (link).