FSAEStats

Motor racing, as a sport, is statistics-adjacent. In the same way that a baseball game, or an entire series, can be recreated from the key measures of players, the story of a race can be told almost entierly from the statistics. The ups and downs and emotional highs of a race, of course, can’t quite be quantified (yet), but the image of a driver, of their personality and strategy and legacy, is told in half floats entries on a CSV.

Despite this, of course, the Formula Student competition provides the results of the yearly competition in a PDF format, which I would say is almost hostile to any form of analysis. And if it’s hostile to excel-spreadsheet-accountant type data analysis, it is outright aggressive towards any attempts to do any data science with it.

Extracting the data from the PDF, which is in tabular format already, is actually quite easy with Camelot. I honestly cannot reccomend this package enough; I spent far too much time trying to clean up last years data which I extracted with PdfPlumber. Once we have a CSV we can do as we’d like with the data, I wrote up a small analysis on last years competition that I would like to revise, and will probably push out on here soon (tm).

What follows is a largely unnecessary exercise in data engineering; there’s not enough data to necessitate a whole system to distribute it. It would be just as easy to send someone CSVs, but what the hell, rust is fun. The goal from the outset was to have an API that would take care of the more tedious work of inner-joins and key-searches, and provide the end user with an effecient format to work with. The final state of this project is something that I am pretty happy with, even if getting there was very messy. I wanted to keep the codebase lean, and reduce the amount of boilerplate necessary, this precluded for me most ORMs. Since the most natural representation of the data is as each event (autocross, acceleration, endurance, etc.) as its own table in the database, since each event has it’s own fields and measures, that means that an ORM would create a different struct for every single event, and then, in the rust code, force an inordinante amount of very similar looking match statements. Is this really an issue? No, not really, it just looks ugly. Thus, begins the somewhat painful journey from Diesal to sqlX to SeaORM to whatever else I could find. I frankly don’t even remember some of the experiments I made in forks that were never pushed.

What I wanted was to be able to handle everything generically and quickly. A column/row from the acceleration results table is not essentially different from a column/row from the endurance results table. I also wanted to be flexible, all of which (and this is mostly a skill issue I think) was pretty difficult with an ORM. So instead I opted to try using DuckDB and communicate to my client with the Apache Arrow format. Which, honestly, was only a little painful!

Critically, DuckDB allows the user to treat rows generically, and simply treat every request as resulting in a RecordBatch.

Now to the code, which is really (and proudly) simple. Practically, we define a UserRequest Struct, implement for it parsing and handle methods, and use the information to query DuckDB:

pub struct UserRequest {
    pub team: String,
    pub year: String,
    pub event: String,
}
...
fn from_hash(args: &mut HashMap<String, String>) -> Result<Box<Self>, ParseError> {
    let team = args.remove("team").ok_or(ParseError::Missing("team"))?;
    let year = args.remove("year").ok_or(ParseError::Missing("year"))?;
    let event = args.remove("event").ok_or(ParseError::Missing("event"))?;

    Ok(Box::new(Self { team, year, event }))
}
async fn handle(self, conn: duckdb::Connection) -> Result<Vec<arrow::array::RecordBatch>> {
    let query: String = format!("SELECT * FROM {} WHERE Team = '{}'", self.event, self.team);
    let mut stmt = conn.prepare(&query)?;
    let rbs = stmt.query_arrow([])?.collect();
    Ok(rbs)
}

Then, from our server handler, we can easily leverage those functions to generate the binary package

match (req.method(), req.uri().path()) {
    (&Method::GET, "/") => Ok(Response::new(full("GET the
    /team/year/event"))),
    (&Method::GET, "/event_arrow") => { 
    let mut request = parse_request(req).await?;
    let response = UserRequest::from_hash(&mut request)?.handle(pool).await?; 
    let mut buf = Cursor::new(Vec::new());
    { 
        let mut writer = arrow::ipc::writer::StreamWriter::try_new(&mut buf, &*response[0].schema())?;

        response.iter().for_each(|x| writer.write(x).unwrap());

        writer.finish()?;
    } 
    Ok(Response::new(full(buf.into_inner()))) 
}

From here, I anticipate adding a couple of things. Firstly, I need to add the rest of the years to the database, which I’ve put off primarily because of tedium. Then, I’d also like to give the user some more convinient end points, like, for example the option to sample all years of a teams performance for a specific event, or even a summary end point which would give the user some convinient statistics. Finally, just to finish it off I’d like to add Swagger Documentation for this API, and maybe spend a weekend cobbeling together a front end exhibition for this.