The Analyst's Cookbook

Wednesday, 10 February 2010

A cornucopia of evidence

"Space is big" said Douglas Adams, ruminating on the scale of the universe.

This bigness is something we need to face whenever we consider the application of logic to evidence (or more classically Logos). The results we derive depend not only on the validity of our analysis, but on the choice of data. Not just the direct validity of the data chosen, but the completeness of our selected sub-set of data relative to the totality of the relevant data.

But the universe of data itself is big in the Douglas Adams sense. In most problems complicated enough to be interesting - especially those relevant at a human scale - there is almost no end to the facts and figures that might have some bearing on the issue.

In contrast to the scale of the relevant data is the scarcity of our attention span. The amount of information that we can consider at a time is small. This is true for blog posts, journal papers, newspaper articles, and conversations. Less so for a book-length piece of work, but even these can synthesize only a tiny fraction of the available evidence.

So the limited choice we can make of the large amount of evidence available is critical to the validity of our work.

Of course the above points not only applies to arguments that we make for other's consumption, but to a certain extent for our own. While the scope of data we can consider when forming an opinion is larger than we can practically present in a cogent argument to someone else it is still limited. Before we can attempt to draw conclusions we need to be sure we've considered the wider context of the data.

A consequence is that whenever we read an argument we are placing our trust in the person making it. Because while we might be able to use our intelligence to follow the argument unless we are an expert at the subject matter we almost certainly don't know the evidence. So we can't see the lies of omission. I suspect we know this at some level, explaining the success of ad-hominem attacks. If our basic trust in the source of an argument is undermined then we stop listening as the logic of the argument becomes irrelevant.

In medicine the effect of the cornucopia of evidence is well known. New drugs, treatments and etiological risk factors usually undergo a plethora of clinical trials with varying quality and statistical power. Even the gold-standard Randomised Controlled Trials often show a pretty mixed pattern of results. Some show no effect, some a negative and some a positive. That's the point of the meta-analysis, it encompasses the cornucopia. Or at least enough of it to reveal the shape of the rest.

Tabloid journalism on the other hand goes out of its way to abuse this concept. Fact 1... fact 2... - inductive leap - verfication! From an handful of emotive crimes to a broken society. The most serious flaw in this process isn't the emphasis on pathos (or ethos for the Mail) over logos, or the use of induction, but the cherry picking of data in the first place. So there's the flippant answer at least - consider what a red-top would do, and don't do that.

But really all this is begging the question... how do we choose our data? I plan to explore that next time.

Wednesday, 13 January 2010

The logic of scientific discovery, Part 3

And back to hanging out with Karl...

Chapter 3, Theories

This chapter heads the second section of the book, where Popper builds up the structural components of his Logic of Science. Here he discusses what a scientific theory is and how it connects to events in the knowable world.

12. Causality, Explanation, and the Deduction of Predictions

Popper defines both scientific theories and the events that they depend on as "statements" without privileging events as more concrete. His theories are universal statements, applying to all possible situations. His events are singular statements, applying to specific objects, times and places. Causality is then explained as a combination of one or more universal statements with one or more singular statements (the initial conditions) that lead to a final condition (the effect).

Popper doesn't assert that every effect has a cause, i.e. that it can be deductively explained (the principle of causality). Depending on the interpretation of the word "can" this is either an empty tautological statement (if we try hard enough we can always find a combination of initial conditions and theories that lead to the observed outcome), or an empty metaphysical one (an unprovable and unfalsifiable assumption about the nature of reality). Either way we don't gain much. Instead Popper again creates a methodological rule to deal with this along the lines that the pursuit of science can never give up the attempt to causally explain the events that we observe (this is essentially "the game never ends").

13. Strict and Numerical Universality

Popper takes a brief detour to distinguish between strictly universal statements - "such and such is true at all places and all times", and numerically universal statements which are only contingently true, e.g. "at the present time there are no persons taller than three metres alive on the Earth". The latter can be verified, in principle, by enumerating all examples and so can be replaced by a finite number of singular statements. The former however are universal assertions about an unlimited number of cases, what Popper calls an all-statement.

We take scientific laws to be comprised of these strictly universal all-statements, again as a methodological principle.

14. Universal Concepts and Individual Concepts

Here Popper closely parallels the previous section in terms of "concepts" rather than "statements". It is obvious from the footnotes that this is part of a lively dialog with other authors relevant at the time but now seems to be largely a semantic exercise. Individual concepts are those localised in some sense that can be defined by proper names or co-ordinates in time and space ("Napoleon", "the Earth", or "the Atlantic"). Universal concepts are those which have no limits to the applicability of their definition ("dictator", "planet", or "H2O"). There are some borderline cases that may be defined one way or the other such "Pasteurization": "the sterilization process invented by Pasteur" or "heated to 80 degrees C and kept there for ten minutes" but this is just use of a single label for two different things. Popper's main point is that a clear distinction exists and statements can be classified one way or another.

15. Strictly Universal and Existential Statements

The previous two sections have built up to this key argument. Popper distinguishes two types of statement using only universal concepts - strictly universal statements and existential statements.

A existential statement has the form of "such and such exists", e.g. "there are non-black ravens".

A strictly universal statement has the form of "such and such does not exist", e.g. "there are no non-black ravens".

These two types of statement are symmetric in the sense that the negation of a strictly universal statement is an existential statement, and vice-versa. Importantly, existential statements are verifiable, while strictly universal statements are falsifiable.

In section 6 we saw that according to Popper's demarcation criteria science is made up of falsifiable statements. Therefore the laws of science are phrased in terms of strictly universal statements prohibiting certain occurrences or states of affairs. While we can observe individual singular statements that can agree with existential statements we don't have the luxury of the infinite series of experiments that it would take to verify them so they don't have status as scientific theories.

16. Theoretical Systems

Popper comments on axiomised systems - this is a mature position for a branch of science where enough work has been done to codify the set of universal statements that determine its behaviour. If this can be done it prevents new assumptions sneaking in without explicitly falsifying some number of the axiom statements. Popper lists the properties these axioms should have in preparation for later sections about falsifiability: consistent with each other; independent of each other; and both sufficient and necessary for deducing all the statements contained within the field.

17. Some Possibilities of interpreting a System of Axioms

What are the nature of axioms though, are they merely well corroborated strictly universal statements? That is, are they scientific hypotheses like any other, or do they have some special status?

Popper takes several pages to basically say that they must simply be hypotheses, expressed as statements like any other. If they are not falsifiable: because they are assumed to be true as "self-evident"; because they are conventions that themselves define whether terms that they introduce are 'allowable'; or because they contain a mixed definition of strictly universal concepts and existential elements, then they lose their empirical basis and no longer have a part to play in a scientific logic.

18. Levels of Universality. The Modus Tollens

The Modus Tollens can be summarised as "If A then B. Not B, therefore not A". Popper applies this construction to structures of hypotheses in a branch of science.

Seeing that our axioms are hypotheses Popper discusses the consequences of deriving secondary hypotheses from primary hypotheses as a kind of ~~tree~~ semi-lattice structure. In principle falsification of these secondary hypotheses cascades upwards, falsifying higher level hypotheses, with the exception of independent hypotheses on another branch of the semi-lattice. In practice it is a bit more complicated than that, which is examined in later chapters.

Chapter 3 Summary

In this chapter Popper has assembled a great deal of structure. He's characterised the elements that make up scientific theories and those that make up the evidence base that they rest upon. Causality has been slotted in as the combination of theories and evidence, though not exactly covered in depth.

We've restated a previous methodological assertions and picked up another one: scientific laws are universal. This seems appropriate - a discovery that scientific laws were different in different places or under different conditions would touch off a frenzy of investigation aimed at discovering the hidden variables that controlled the change. The working assumption would be that we were missing some higher order theory to explain the difference in behaviour - exactly Popper's point.

Scientific theories are firmly pinned to the mechanism of falsifiability - it's only a small exaggeration to say that the two words are synonymous in Popper's work. There has also been a brief look at some of the mechanics of falsifiability of our theories, and we still have the open question of how to tie our theories to the real world. Further exploration of these two questions occupies the next three chapters of the book.

All in all this is a good gutsy chapter that builds many basic components of Popper's swamp hut.

Saturday, 9 January 2010

Analytics X-Prize, Part 1

I came across the Analytics X-prize while reading hacker news last week. The challenge is to predict the distribution of homicides in the city of Philadelphia over the course of 2010. It did cross my mind to wonder if this is a FBI sting of some sort*. Assuming it's legit though I thought I'd give it a go. Not being American and knowing nothing about crime beyond what I've seen on The Wire will make it an interesting case study.

The whole thing has a slightly macabre aspect I suppose, and reminds me a little of the ill-fated Policy Analysis Market. Still, the stated intention is to drive development of tools which will ultimately save lives, so I guess we're alright ethically.

Okay then, first thoughts:

The number of homicides has decreased in recent years, from 406 in 2006 to 305 in 2009. This could be for many reasons, but it shows time dependence and implies that historic data may not necessarily be a reliable indicator of future trends.
Narrowly defined, the brief is to predict the number of homicides in 47 zip-code districts. The numbers per district are likely to be low and finite small effects may be significant.
More widely defined, the brief is to probe the causal factions of homicide. There are a few data sources suggested in the X-Prize forum, but at the moment these isn't any geo-coded data available. More than that though I need data about the socio-economics of Philadelphia.
I need to do some serious background reading.

* There's also a schlock thriller waiting to be written featuring a killer nerd stalking the streets of a major American city "adjusting" the murder rate to fit his model...

Wednesday, 6 January 2010

Cookbook narratives

New years is the time for retrospection, introspection, and prognostication...

So, why the "Analyst's Cookbook" (apart from my love of weak puns)? What do cookbooks achieve anyway?

And why, though there seems to be a long term trend towards less time spent cooking, are there so many of them published these days (I've no links or evidence for that, just my bemused observation)? It's puzzling.

Perhaps there is a Theory X and Theory Y of cookery. Theory X 'authoritarian' cookbooks are full of didactic recipes, to be followed to the letter, which assume that people are uncreative and want the security of being told what to do. So if you are only allowed to colour between the lines there need be plenty of colouring books. Theory Y 'participative' cookbooks de-emphasise the recipes and talk around techniques, allowing the tacit knowledge to be read between the lines. Rarer to read between than colour between though, in my experience.

Or perhaps this is a novice/expert dichotomy. I recall from an old Open University programme (which one precisely I've long since forgotten) on educating expert learning systems: "if asked to explain something novices will always quote rules and experts will always quote examples". Novice cooks would then need to be given a lot of ~~rules~~ recipes to follow, though seemingly with little or no effort to teach them to cook for themselves.

Though maybe its a socio-economic thing. If changes in societal gender-roles is driving the trend in time spent cooking, then simple marketing is driving the increasing numbers of cookbooks - perhaps buying them is another way of cooking vicariously, and naturally puts money in the pockets of the celebrity chef establishment.

Or the ever popular all of the above. My take is that recipes are too coarse-grained. Cooking knowledge should be ingredient based. Rather than knowing how to cook a recipe you should know how to cook single ingredients. This is in fact easier - you just need to learn to deal with cooking order n ingredients by m techniques rather than some horrendous combinatorial explosion of dishes. You can also diff the behaviour to simplify things further: root veg acts similarly in most cooking techniques, some a bit tougher (beetroot) and some a bit less so (parsnips).

Recipes are then the map rather than the territory. Every one is a jumping-off point for your own adventures. They become as simple as "Roasted pork and potatoes, with peas" and the rest of the book is about the cascade of options and trade-offs involved in roasting root vegetables and fatty cuts of meat. When you know ingredients rather than recipes you can cook with any random assortment of leftovers. Putting together a meal from scratch becomes a small act of apt creation, that you then get to eat.

We need cookbooks that tell us what cooking is about - that teach us it's narratives. We need better cookbooks, not more of them.

Now search/replace "cooking" with "analysis" above. That's this blog. Hopefully.

Wednesday, 30 December 2009

Tangerine Seams

Where I grew up "tangerines" was a collective noun for satsumas, mandarins and clementines. They were always a Christmas treat, making this a festive post :)

Of the three satsumas were my favourite. Partly they just tasted better and partly because they were so much easier to get into. A stiff thumb into the bottom and up the central column usually split the skin and spread the segments nicely. A twist and tug would simultaneously separate the segments from the peel and splay them ready to be pulled out. Contrast that with mandarins, where the above procedure would kinda-sorta work but usually leave the segments mashed and your hands covered in pungent peel-sweat. Tougher mandarins and virtually all clemantines would resist hand peeling completely and need either quartering with a chef's knife or painstaking removal of dozens fingernail sized pieces of peel. No 10-year old has that kind of time.

So then, why is this? What are the differences between the three?

To my mind citrus fruit break down into three relevant parts, the segments, the skin, and the pith binding them together. By "strength" in this case I mean something like "resistance to tearing".

Skin strength. This varies with the thickness and rigidity of the skin. I'm going to assert that this is pretty much the same for all my tangerines (though I couldn't get away with this if I was considering oranges too). So while it might set a scale for some of the behaviours it can't explain the differences between the three and I'll leave it out of the remainder of this post.

Segment strength. This is the integrity of any particular segment.

Pith strength. The integrity of the pith itself, and in particular its ability to bind the skin to the segments and the segments to each other (I'm assuming that these are more or less the same).

Since I slyly ditched the skin strength I only have one combinatorial ratio to worry about - the segment to pith strength. Variations in this single parameter can explain our three test cases. For segment strength>>pith strength we have satsumas, for segment strength<<pith strength we have clementines, and somewhere in the middle we have mandarins.

Essentially I'm saying that the segments (and peel) are semi-independent entities coupled together by the pith. Where the coupling is weak the segments interact weakly and are easily separated. Where the coupling is strong the peel and segments act almost like a single entity that must be cut as a whole or chipped away with a knife. In between the two extremes is more complicated behaviour where many solutions are equally sub-optimal. Segment strength/pith strength is the coupling constant for the system. What I'm doing here is pretty close to a static consideration of a typical simple harmonic oscillator approach (making the "strengths" we are talking about the analogous to the inverse frequencies, the square root of the co-efficient of the restoring acceleration).

Why am I writing about tangerines anyway? Well, I was at a medical conference a few weeks ago when I suddenly thought of tangerines. A urologist was talking about new endoscopic techniques in holmium laser surgery for prostate cancer. Modern treatments for prostate cancer involve destruction of the prostate gland in a wide variety of ways. Now, the prostate is divided into several lobes with membranes in between. The novelty of the technique that the surgeon described, as far as I understood it, was that a remotely manipulated endoscopic laser enabled him to separate and cut out the lobes of the prostate. This used to be easily done in open surgery, but commonly with unfortunate collateral damage to the bundles of nerves that run through that area of the body. There was some compelling video shot from the tip of the endoscope that showed the ease with which he could burn away the membranes and separate the lobes. You see the analogy I'm sure. What stuck me though is that the advantage of one technique versus another is a question of how well they exploit the existing segment to pith ratio.

I was also reminded of a different conference a few years back when I saw a presentation about the epidemiology of stomach and esophageal cancer. ICD-10 coding lumps the upper, middle and lower part of the esophagus together and separates them from the cardia and the rest of the stomach. However, in terms of tissue types (which strongly affect the aetiology of cancer), the lower esophagus is closer to the stomach than it is to the upper esophagus. Here the physical segmentation of the body has acquired a parallel histological coding and the choice between them is the difference between clay and chocolate.

So generalising... applying coupling theory is a nice mini-narrative for considering systems with complex inter-dependent parts, and the possible interventions in them. Happy holidays.

Wednesday, 9 September 2009

The Dependency Principle

I’m taking a week’s break from consideration of Popper’s work. It seems timely to mention one of the differences between science and information analysis – in general analysts don’t do their own experiments (i.e. generate their own data), while scientists who work directly with experimental data in general do.

Information analysts tend to have their data handed to them. Sometimes from a different department in their organisation, sometimes from entirely outside. Let’s pretend for now that that is the only difference between the role of experiment in the work of analysts and scientists (reserving an exploration of the others for a later post). The consequences that the analyst lacks an overview of the quality, completeness, and coding of the data.

Let’s briefly look at a sketch of the analytical process after the data is received:

Firstly there is (hopefully!) some sort of data checking, cleaning and familiarisation.

After that we might do some pre-processing of the data to turn it from raw data into something that encodes a bit more meaning (for instance turning counts into rates, or grosses into nets).

Then we could get clever with some sophisticated multi-variate analysis, GIS, data-mining or somesuch.

After that we can maybe draw some conclusions, bounce our results off other people’s work. We could bring in causality and the Reverend Bayes and start feeling pretty impressed with ourselves.

But look down… we’re standing on the fifth step of a five step structure, and the foundation was outside our control.

The name of this post draws an analogy between the structure above and a concept introduced by Iain Banks in his novel Excession. This is an unashamed but entertainingly modern space opera which has as it’s most engaging protagonists a bunch of artificially intelligences – minds – and the spacecraft that effectively make up their bodies. As superhuman intelligences they are frustrated with the slow (to them) action in the real world and pace of human affairs. They spend their leisure time in self-created artificial universes beyond the ken of man. But look down…

“... if you spent all your time having Fun, with no way back to reality, or just no idea what to do to protect yourself when you did get back there, then you were vulnerable. In fact, you were probably dead... it didn't matter that base reality was of no consequence aesthetically, hedonistically, metamathically, intellectually and philosophically; if that was the foundation stone that all your higher-level comfort and joy rested on, and it was kicked away from underneath you, you fell, and your limitless pleasure realms fell with you.” - Iain Banks, Excession

You are completely reliant on your data collection methods and lowly paid data gathering staff, no matter how clever you are or how advanced your analytical techniques. That’s The Dependency Principle.

Wednesday, 2 September 2009

The logic of scientific discovery, Part 2

Continuing on with the Popper...

Chapter 2, On the Problem of a Theory of Scientific Method

This is a short chapter which re-iterates and expands several of the points already made, emphasising the methodologically contingent quality of science.

10. Why Methodological Decisions are Indispensable

Unlike some philosophers of science Popper doesn't see it as a logical system which can produce either provably correct results, nor (absolutely) prove something false - there's wiggle room both at the top and the bottom. The theory of science itself is a built thing. In this picture of science methodological decisions are indispensable, to sew up the loopholes (cherry picking data, rewriting hypotheses post-hoc, and similar naughtiness).

11. The Naturalistic Approach to the Theory of Method

Here Popper spends a few pages talking down orthodox positivism (too-brief summary: positivism is the doctrine that the legitimacy of all knowledge is based solely in personal experience). This position, popular at the time of Popper's writing, requires that a theory of science can be only be based on an empirical study of scientists themselves - purely empirical, experience based, and 'naturalistic'.

This kind of absolutism - that modes of thought outside of personal experience literally have no meaning - sounds odd to modern ears. Popper contends, more relativistically, that there may (potentially at least) be many possible sciences with or without this or that feature or methodological rule, as long as they stay free of internal contradiction. The nature of science itself is not empirically discoverable, but decided.

12. Methodological Rules as Conventions

We see Popper use the analogy of the Logic of Science as the rules of a game. He then sets out some of the more important ones, to his way of playing:

The game never ends. Hypotheses can always be falsified. It always cheers me when I see today's young punks going after gravity.

Winner plays on. Once a hypothesis has been tested against the evidence and corroborated to some extent it doesn't drop out of the game until it's falsified or replaced by another one with a wider scope (the scope of hypotheses to be defined later).

Ya gotta take ya licks. This is a kind of higher order rule that specifies that no hypothesis should be protected from the rough and tumble. To do so would make it unfalsifiable and exclude it from the business of science altogether.

These are by no means an exhaustive set of those proposed in the rest of the book - this chapter is just a taster. Though in fact Popper purposely avoids writing a science methodology shopping list. The further rules, their nature, and the connections between them are the main subject of the rest of the book.

Chapter 2 Summary

So then, we've seen science defined methodologically. Some of it's main structures have been exposed and tied into the partial logical framework constructed in Chapter 1. Popper's science is a built thing and stands on the merit that its insights are useful.

Popper quotes:

"Definitions are dogmas; only the conclusions drawn from them can afford us any new insight" - Karl Menger

Though to what extent does science tell us about the world as it is? To what extent are our hypotheses true? Good questions both...