Need input for an analysis

besserheimerphat

Well-Known Member
Apr 11, 2006
11,485
15,329
113
Mount Vernon, WA
I finished up an R program that does some text analysis on basketball play by play data. The way it works right now, it "reads" all the plays between baskets, then counts how often a word or group of words appears in that section of text. It does this for the entire season. I also pull out the home and visitor scores, calculate score difference, and time remaining in the half. My original plan was to look at what words appear most often when the Cyclones make a run, the opponents make a run, or we'really trading baskets. But what constitutes a "run?" How would you codify it? What else should I look at with this data? I can do this for prior years too for any season play by play data that's available at Cyclones.com.
 

MikeNighttt

New Member
Feb 1, 2016
17
0
1
44
Maybe do some 3(whatever) sigma analysis on it and see what falls out of standard deviation during runs and let that guide you if it's an outlier?
 

kucyclone

Well-Known Member
Jan 16, 2008
2,647
128
63
Seattle
I don't know how you would code it - I haven't used R since college - but I'm thinking a good starting point is at least a 10 point margin, and less than 5 points allowed.

You can define this a bunch of ways, of course, depending how much data you want to look at. My definition above wouldn't catch a quick 8-0 or 11-2 run, and also wouldn't catch a long stretch of 25-8, for example.

You could also do some sort of linear function, like anything where y > 7 + 1.5x.

Also, if there are substitutions in the data (I assume there's not) you could do a lot of cool on/off stuff, although the value of that is probably less with this team having so few subs to begin with.
 

Goothrey

Well-Known Member
May 5, 2009
4,882
636
113
Dayton via Austin
Put it on github? Or tidy up the source data into some sort of game summary?


I was literally just looking at ESPN play by play data for the ISU/KU game in Ames.

Right now I am making a drive chart for college football scraping HTML tables on NBCsports with MATLAB. r/CFBanalysis has a poster who created a data scraper using R as well. I used R in Stat 341.


As for what else to do with your program...perhaps you could find performance figures for every 5 minutes, or after timeouts. Try to find trends throughout the game. Like for instance with this season, we know ISU tends to fall apart towards the ends of games. When does it happen? Why does it happen? Perhaps those questions could be answered from the data you have scraped. Perhaps data visualization can be implemented here.
 

besserheimerphat

Well-Known Member
Apr 11, 2006
11,485
15,329
113
Mount Vernon, WA
I don't know how you would code it - I haven't used R since college - but I'm thinking a good starting point is at least a 10 point margin, and less than 5 points allowed.

You can define this a bunch of ways, of course, depending how much data you want to look at. My definition above wouldn't catch a quick 8-0 or 11-2 run, and also wouldn't catch a long stretch of 25-8, for example.

You could also do some sort of linear function, like anything where y > 7 + 1.5x.

Also, if there are substitutions in the data (I assume there's not) you could do a lot of cool on/off stuff, although the value of that is probably less with this team having so few subs to begin with.

That's a good idea. My first time through, I said a run was occurring anytime the point differential moved the same direction three times in a row. So that catches smaller runs, but may not work right for longer stretches like a 14 to 4 run unless the baskets were spaced a certain way. I'm looking for a more generalized rule and some kind of equation like you mentioned may be the way to go.
 

Pat

Well-Known Member
Oct 20, 2011
2,410
3,543
113
KenPom used 10-0 for some work he did, and gave good reasons. Nylon Calculus also did a post on NBA "spurtability" (not making this up) based on 7-0. Either is probably safe.
 

besserheimerphat

Well-Known Member
Apr 11, 2006
11,485
15,329
113
Mount Vernon, WA
Put it on github? Or tidy up the source data into some sort of game summary?


I was literally just looking at ESPN play by play data for the ISU/KU game in Ames.

Right now I am making a drive chart for college football scraping HTML tables on NBCsports with MATLAB. r/CFBanalysis has a poster who created a data scraper using R as well. I used R in Stat 341.


As for what else to do with your program...perhaps you could find performance figures for every 5 minutes, or after timeouts. Try to find trends throughout the game. Like for instance with this season, we know ISU tends to fall apart towards the ends of games. When does it happen? Why does it happen? Perhaps those questions could be answered from the data you have scraped. Perhaps data visualization can be implemented here.

The fatigue question was what started me down this rabbit hole in the first place :smile:
 

EarthIsMan

Well-Known Member
SuperFanatic
SuperFanatic T2
Nov 23, 2014
643
1,154
93
Earth
To me a "run" is the difference in points between two teams over a specified amount of time. To compare 2 teams you could look at the difference in scoring rates. For example the last 2:39 of the second half of ISU vs. Iowa in 2015-16:

Team 1 (Iowa): Score[SUB]B [/SUB]- Score[SUB]A / [/SUB]Time[SUB]B[/SUB] - Time[SUB]A[/SUB]= 82-80 / 2:39-0:00 = 2/2:39

Team 2 (ISU): Score[SUB]B [/SUB]- Score[SUB]A / [/SUB]Time[SUB]B[/SUB] - Time[SUB]A[/SUB]= 83-74/ 2:39- 0:00 = 9/2:39

The value of run really depends on the difference in scoring rates between 2 teams:
+ difference of scoring rate => team is on a run
= difference of scoring rate => no runs for either team
- difference of scoring rate => opponent is on a run

Probably the question is what periods of time you should consider to evaluate scoring rate differentials. For example, if you use longer periods of time within a game it could be less sensitive to detecting a run. While it would generate enormous amounts of data, evaluating 30 second periods should detect runs in the fastest pace offenses.

I would have to think about the coding language, but in concept this is how I would look at runs.

Graphically, I think this is really helpful in "seeing" the differences in runs within a given game. I chose this game for obvious purposes.
attachment.php

Hopefully that made some sense and is somewhat helpful.
 

Attachments

  • Gameflow.PNG
    Gameflow.PNG
    19 KB · Views: 64
Last edited:

besserheimerphat

Well-Known Member
Apr 11, 2006
11,485
15,329
113
Mount Vernon, WA
To me a "run" is the difference in points between two teams over a specified amount of time. To compare 2 teams you could look at the difference in scoring rates. For example the last 2:39 of the second half of ISU vs. Iowa in 2015-16:

Team 1 (Iowa): Score[SUB]B [/SUB]- Score[SUB]A / [/SUB]Time[SUB]B[/SUB] - Time[SUB]A[/SUB]= 81-80 / 2:39-0:00 = 1/2:39

Team 2 (ISU): Score[SUB]B [/SUB]- Score[SUB]A / [/SUB]Time[SUB]B[/SUB] - Time[SUB]A[/SUB]= 82-74/ 2:39- 0:00 = 8/2:39

The value of run really depends on the difference in scoring rates between 2 teams:
+ difference of scoring rate => team is on a run
= difference of scoring rate => no runs for either team
- difference of scoring rate => opponent is on a run

Probably the question is how you what time of time periods you should consider to evaluate scoring rate differentials. For example, if you use longer periods of time within a game it could be less sensitive to detecting a run. While it would generous enormous amounts of data evaluating 30 second periods should detect runs in the fastest pace offenses.

I would have to thing about the coding language, but in concept this is how I would look at runs.

Graphically, I think is really helpful in "seeing" the differences in runs within a given game. I chose this game for obvious purposes.
attachment.php

Mobile like. This is very helpful, thanks! I can do the actual coding, just wasn't sure how to define a run. But I knew at least CF would have ideas. I'm going to give this a shot and see how it works.