The Analyst's Cookbook: 2010

Wednesday, 10 February 2010

A cornucopia of evidence

"Space is big" said Douglas Adams, ruminating on the scale of the universe.

This bigness is something we need to face whenever we consider the application of logic to evidence (or more classically Logos). The results we derive depend not only on the validity of our analysis, but on the choice of data. Not just the direct validity of the data chosen, but the completeness of our selected sub-set of data relative to the totality of the relevant data.

But the universe of data itself is big in the Douglas Adams sense. In most problems complicated enough to be interesting - especially those relevant at a human scale - there is almost no end to the facts and figures that might have some bearing on the issue.

In contrast to the scale of the relevant data is the scarcity of our attention span. The amount of information that we can consider at a time is small. This is true for blog posts, journal papers, newspaper articles, and conversations. Less so for a book-length piece of work, but even these can synthesize only a tiny fraction of the available evidence.

So the limited choice we can make of the large amount of evidence available is critical to the validity of our work.

Of course the above points not only applies to arguments that we make for other's consumption, but to a certain extent for our own. While the scope of data we can consider when forming an opinion is larger than we can practically present in a cogent argument to someone else it is still limited. Before we can attempt to draw conclusions we need to be sure we've considered the wider context of the data.

A consequence is that whenever we read an argument we are placing our trust in the person making it. Because while we might be able to use our intelligence to follow the argument unless we are an expert at the subject matter we almost certainly don't know the evidence. So we can't see the lies of omission. I suspect we know this at some level, explaining the success of ad-hominem attacks. If our basic trust in the source of an argument is undermined then we stop listening as the logic of the argument becomes irrelevant.

In medicine the effect of the cornucopia of evidence is well known. New drugs, treatments and etiological risk factors usually undergo a plethora of clinical trials with varying quality and statistical power. Even the gold-standard Randomised Controlled Trials often show a pretty mixed pattern of results. Some show no effect, some a negative and some a positive. That's the point of the meta-analysis, it encompasses the cornucopia. Or at least enough of it to reveal the shape of the rest.

Tabloid journalism on the other hand goes out of its way to abuse this concept. Fact 1... fact 2... - inductive leap - verfication! From an handful of emotive crimes to a broken society. The most serious flaw in this process isn't the emphasis on pathos (or ethos for the Mail) over logos, or the use of induction, but the cherry picking of data in the first place. So there's the flippant answer at least - consider what a red-top would do, and don't do that.

But really all this is begging the question... how do we choose our data? I plan to explore that next time.

Wednesday, 13 January 2010

The logic of scientific discovery, Part 3

And back to hanging out with Karl...

Chapter 3, Theories

This chapter heads the second section of the book, where Popper builds up the structural components of his Logic of Science. Here he discusses what a scientific theory is and how it connects to events in the knowable world.

12. Causality, Explanation, and the Deduction of Predictions

Popper defines both scientific theories and the events that they depend on as "statements" without privileging events as more concrete. His theories are universal statements, applying to all possible situations. His events are singular statements, applying to specific objects, times and places. Causality is then explained as a combination of one or more universal statements with one or more singular statements (the initial conditions) that lead to a final condition (the effect).

Popper doesn't assert that every effect has a cause, i.e. that it can be deductively explained (the principle of causality). Depending on the interpretation of the word "can" this is either an empty tautological statement (if we try hard enough we can always find a combination of initial conditions and theories that lead to the observed outcome), or an empty metaphysical one (an unprovable and unfalsifiable assumption about the nature of reality). Either way we don't gain much. Instead Popper again creates a methodological rule to deal with this along the lines that the pursuit of science can never give up the attempt to causally explain the events that we observe (this is essentially "the game never ends").

13. Strict and Numerical Universality

Popper takes a brief detour to distinguish between strictly universal statements - "such and such is true at all places and all times", and numerically universal statements which are only contingently true, e.g. "at the present time there are no persons taller than three metres alive on the Earth". The latter can be verified, in principle, by enumerating all examples and so can be replaced by a finite number of singular statements. The former however are universal assertions about an unlimited number of cases, what Popper calls an all-statement.

We take scientific laws to be comprised of these strictly universal all-statements, again as a methodological principle.

14. Universal Concepts and Individual Concepts

Here Popper closely parallels the previous section in terms of "concepts" rather than "statements". It is obvious from the footnotes that this is part of a lively dialog with other authors relevant at the time but now seems to be largely a semantic exercise. Individual concepts are those localised in some sense that can be defined by proper names or co-ordinates in time and space ("Napoleon", "the Earth", or "the Atlantic"). Universal concepts are those which have no limits to the applicability of their definition ("dictator", "planet", or "H2O"). There are some borderline cases that may be defined one way or the other such "Pasteurization": "the sterilization process invented by Pasteur" or "heated to 80 degrees C and kept there for ten minutes" but this is just use of a single label for two different things. Popper's main point is that a clear distinction exists and statements can be classified one way or another.

15. Strictly Universal and Existential Statements

The previous two sections have built up to this key argument. Popper distinguishes two types of statement using only universal concepts - strictly universal statements and existential statements.

A existential statement has the form of "such and such exists", e.g. "there are non-black ravens".

A strictly universal statement has the form of "such and such does not exist", e.g. "there are no non-black ravens".

These two types of statement are symmetric in the sense that the negation of a strictly universal statement is an existential statement, and vice-versa. Importantly, existential statements are verifiable, while strictly universal statements are falsifiable.

In section 6 we saw that according to Popper's demarcation criteria science is made up of falsifiable statements. Therefore the laws of science are phrased in terms of strictly universal statements prohibiting certain occurrences or states of affairs. While we can observe individual singular statements that can agree with existential statements we don't have the luxury of the infinite series of experiments that it would take to verify them so they don't have status as scientific theories.

16. Theoretical Systems

Popper comments on axiomised systems - this is a mature position for a branch of science where enough work has been done to codify the set of universal statements that determine its behaviour. If this can be done it prevents new assumptions sneaking in without explicitly falsifying some number of the axiom statements. Popper lists the properties these axioms should have in preparation for later sections about falsifiability: consistent with each other; independent of each other; and both sufficient and necessary for deducing all the statements contained within the field.

17. Some Possibilities of interpreting a System of Axioms

What are the nature of axioms though, are they merely well corroborated strictly universal statements? That is, are they scientific hypotheses like any other, or do they have some special status?

Popper takes several pages to basically say that they must simply be hypotheses, expressed as statements like any other. If they are not falsifiable: because they are assumed to be true as "self-evident"; because they are conventions that themselves define whether terms that they introduce are 'allowable'; or because they contain a mixed definition of strictly universal concepts and existential elements, then they lose their empirical basis and no longer have a part to play in a scientific logic.

18. Levels of Universality. The Modus Tollens

The Modus Tollens can be summarised as "If A then B. Not B, therefore not A". Popper applies this construction to structures of hypotheses in a branch of science.

Seeing that our axioms are hypotheses Popper discusses the consequences of deriving secondary hypotheses from primary hypotheses as a kind of ~~tree~~ semi-lattice structure. In principle falsification of these secondary hypotheses cascades upwards, falsifying higher level hypotheses, with the exception of independent hypotheses on another branch of the semi-lattice. In practice it is a bit more complicated than that, which is examined in later chapters.

Chapter 3 Summary

In this chapter Popper has assembled a great deal of structure. He's characterised the elements that make up scientific theories and those that make up the evidence base that they rest upon. Causality has been slotted in as the combination of theories and evidence, though not exactly covered in depth.

We've restated a previous methodological assertions and picked up another one: scientific laws are universal. This seems appropriate - a discovery that scientific laws were different in different places or under different conditions would touch off a frenzy of investigation aimed at discovering the hidden variables that controlled the change. The working assumption would be that we were missing some higher order theory to explain the difference in behaviour - exactly Popper's point.

Scientific theories are firmly pinned to the mechanism of falsifiability - it's only a small exaggeration to say that the two words are synonymous in Popper's work. There has also been a brief look at some of the mechanics of falsifiability of our theories, and we still have the open question of how to tie our theories to the real world. Further exploration of these two questions occupies the next three chapters of the book.

All in all this is a good gutsy chapter that builds many basic components of Popper's swamp hut.

Saturday, 9 January 2010

Analytics X-Prize, Part 1

I came across the Analytics X-prize while reading hacker news last week. The challenge is to predict the distribution of homicides in the city of Philadelphia over the course of 2010. It did cross my mind to wonder if this is a FBI sting of some sort*. Assuming it's legit though I thought I'd give it a go. Not being American and knowing nothing about crime beyond what I've seen on The Wire will make it an interesting case study.

The whole thing has a slightly macabre aspect I suppose, and reminds me a little of the ill-fated Policy Analysis Market. Still, the stated intention is to drive development of tools which will ultimately save lives, so I guess we're alright ethically.

Okay then, first thoughts:

The number of homicides has decreased in recent years, from 406 in 2006 to 305 in 2009. This could be for many reasons, but it shows time dependence and implies that historic data may not necessarily be a reliable indicator of future trends.
Narrowly defined, the brief is to predict the number of homicides in 47 zip-code districts. The numbers per district are likely to be low and finite small effects may be significant.
More widely defined, the brief is to probe the causal factions of homicide. There are a few data sources suggested in the X-Prize forum, but at the moment these isn't any geo-coded data available. More than that though I need data about the socio-economics of Philadelphia.
I need to do some serious background reading.

* There's also a schlock thriller waiting to be written featuring a killer nerd stalking the streets of a major American city "adjusting" the murder rate to fit his model...

Wednesday, 6 January 2010

Cookbook narratives

New years is the time for retrospection, introspection, and prognostication...

So, why the "Analyst's Cookbook" (apart from my love of weak puns)? What do cookbooks achieve anyway?

And why, though there seems to be a long term trend towards less time spent cooking, are there so many of them published these days (I've no links or evidence for that, just my bemused observation)? It's puzzling.

Perhaps there is a Theory X and Theory Y of cookery. Theory X 'authoritarian' cookbooks are full of didactic recipes, to be followed to the letter, which assume that people are uncreative and want the security of being told what to do. So if you are only allowed to colour between the lines there need be plenty of colouring books. Theory Y 'participative' cookbooks de-emphasise the recipes and talk around techniques, allowing the tacit knowledge to be read between the lines. Rarer to read between than colour between though, in my experience.

Or perhaps this is a novice/expert dichotomy. I recall from an old Open University programme (which one precisely I've long since forgotten) on educating expert learning systems: "if asked to explain something novices will always quote rules and experts will always quote examples". Novice cooks would then need to be given a lot of ~~rules~~ recipes to follow, though seemingly with little or no effort to teach them to cook for themselves.

Though maybe its a socio-economic thing. If changes in societal gender-roles is driving the trend in time spent cooking, then simple marketing is driving the increasing numbers of cookbooks - perhaps buying them is another way of cooking vicariously, and naturally puts money in the pockets of the celebrity chef establishment.

Or the ever popular all of the above. My take is that recipes are too coarse-grained. Cooking knowledge should be ingredient based. Rather than knowing how to cook a recipe you should know how to cook single ingredients. This is in fact easier - you just need to learn to deal with cooking order n ingredients by m techniques rather than some horrendous combinatorial explosion of dishes. You can also diff the behaviour to simplify things further: root veg acts similarly in most cooking techniques, some a bit tougher (beetroot) and some a bit less so (parsnips).

Recipes are then the map rather than the territory. Every one is a jumping-off point for your own adventures. They become as simple as "Roasted pork and potatoes, with peas" and the rest of the book is about the cascade of options and trade-offs involved in roasting root vegetables and fatty cuts of meat. When you know ingredients rather than recipes you can cook with any random assortment of leftovers. Putting together a meal from scratch becomes a small act of apt creation, that you then get to eat.

We need cookbooks that tell us what cooking is about - that teach us it's narratives. We need better cookbooks, not more of them.

Now search/replace "cooking" with "analysis" above. That's this blog. Hopefully.

The Analyst's Cookbook

Wednesday, 10 February 2010

A cornucopia of evidence

Wednesday, 13 January 2010

The logic of scientific discovery, Part 3

Saturday, 9 January 2010

Analytics X-Prize, Part 1

Wednesday, 6 January 2010

Cookbook narratives

About Me

Useful links

Labels

Blog Archive