Thursday, 20 August 2015

My First Webscrape - Premier League Player Ratings 2014/15

A couple of things led to the creation of this visualisation - a couple of colleagues and I are attempting to build models predicting the outcome of every Premier League match this season, with varying degrees of success, and secondly I saw Chris Love's awesome scrape of the BBC live text data. That got me thinking, if that complete mess of data can be scraped, anything can.

At first I tried to use the Alteryx Download and JSON Parse tools in a similar way to Carl at The Information Lab, but I'm a complete novice and couldn't get it to work. Definitely running before I can walk. But fortunately enough, I stumbled upon Data School student Hashu Shenkar's post on using Import.io in conjunction with Alteryx. I'd used Import.io before, but wasn't really aware just how powerful it could be - Hashu's post opened my eyes to what it could do.

I wanted to scrape Whoscored.com's Player Summary data, club-by-club for last season, for clubs who are currently in the Premier League (so we exclude QPR, Hull & the other one who got relegated) and include Norwich, Watford and Bournemouth's Championship stats (you can highlight and filter their data if you are against comparing apples with pears). I soon realised that WhoScored isn't the easiest to scrape from, but managed to get the Summary data, after about 8 attempts - for reasons unbeknown to me, when I published my API and ran my block list of 20 URLs through it, some would fail. I then had an issue where the API was skipping over players who were transferred out, such as Christian Benteke.

Anyway, data all downloaded I put it into Alteryx to clean it up, get rid of repeated fields, weird characters and organise it ready for Tableau. I'm currently evaluating Alteryx for a couple of weeks so I thought what better way to ease myself into it? The raw data scraped using Importio included Player Position data (as you'll see in the viz) in this kind of format: AM(RLC), FW. So I created a workflow in Alteryx to break that down into Individual positions: AMR, AML, AMC and FW with one row per player per position.

Once that was done, I had two csv files ready for Tableau Public!

Within this data there's a lot of insight to be gained, all sorts of interesting little stats hidden away and patterns emerge pretty quickly. I particularly enjoy how Chelsea's players are split in to two clusters, rating-wise, probably the only team aside from AFC Bournemouth, who have a clearly defined starting 11 with minimal rotation. That served them well last season, but with the increased competition in the PL and the step up for Bournemouth, is it unrealistic to expect that same approach to work for both teams this season? No doubt we'll find out.

One final point, WhoScored also have separate tabs (Javascript I believe) for offensive, defensive and passing statistics, but try as I may, I couldn't get import.io to scrape from those - any tips on how to do that would be most welcome. Enjoy the viz, was great fun making it from start to finish.

As usual, everything is interactive so click away and see what you can find. On the second tab, I've picked out some stats that caught my attention.


2 comments:

  1. could you please tell me how you scraped player data from whoscored.com ?

    ReplyDelete
  2. thanks for the post. Same comment from me. How did you scrape the data. I am really struggling.

    ReplyDelete