The Analyst's Cookbook: 2009

Wednesday 30 December 2009

Tangerine Seams

Where I grew up "tangerines" was a collective noun for satsumas, mandarins and clementines. They were always a Christmas treat, making this a festive post :)

Of the three satsumas were my favourite. Partly they just tasted better and partly because they were so much easier to get into. A stiff thumb into the bottom and up the central column usually split the skin and spread the segments nicely. A twist and tug would simultaneously separate the segments from the peel and splay them ready to be pulled out. Contrast that with mandarins, where the above procedure would kinda-sorta work but usually leave the segments mashed and your hands covered in pungent peel-sweat. Tougher mandarins and virtually all clemantines would resist hand peeling completely and need either quartering with a chef's knife or painstaking removal of dozens fingernail sized pieces of peel. No 10-year old has that kind of time.

So then, why is this? What are the differences between the three?

To my mind citrus fruit break down into three relevant parts, the segments, the skin, and the pith binding them together. By "strength" in this case I mean something like "resistance to tearing".

Skin strength. This varies with the thickness and rigidity of the skin. I'm going to assert that this is pretty much the same for all my tangerines (though I couldn't get away with this if I was considering oranges too). So while it might set a scale for some of the behaviours it can't explain the differences between the three and I'll leave it out of the remainder of this post.

Segment strength. This is the integrity of any particular segment.

Pith strength. The integrity of the pith itself, and in particular its ability to bind the skin to the segments and the segments to each other (I'm assuming that these are more or less the same).

Since I slyly ditched the skin strength I only have one combinatorial ratio to worry about - the segment to pith strength. Variations in this single parameter can explain our three test cases. For segment strength>>pith strength we have satsumas, for segment strength<<pith strength we have clementines, and somewhere in the middle we have mandarins.

Essentially I'm saying that the segments (and peel) are semi-independent entities coupled together by the pith. Where the coupling is weak the segments interact weakly and are easily separated. Where the coupling is strong the peel and segments act almost like a single entity that must be cut as a whole or chipped away with a knife. In between the two extremes is more complicated behaviour where many solutions are equally sub-optimal. Segment strength/pith strength is the coupling constant for the system. What I'm doing here is pretty close to a static consideration of a typical simple harmonic oscillator approach (making the "strengths" we are talking about the analogous to the inverse frequencies, the square root of the co-efficient of the restoring acceleration).

Why am I writing about tangerines anyway? Well, I was at a medical conference a few weeks ago when I suddenly thought of tangerines. A urologist was talking about new endoscopic techniques in holmium laser surgery for prostate cancer. Modern treatments for prostate cancer involve destruction of the prostate gland in a wide variety of ways. Now, the prostate is divided into several lobes with membranes in between. The novelty of the technique that the surgeon described, as far as I understood it, was that a remotely manipulated endoscopic laser enabled him to separate and cut out the lobes of the prostate. This used to be easily done in open surgery, but commonly with unfortunate collateral damage to the bundles of nerves that run through that area of the body. There was some compelling video shot from the tip of the endoscope that showed the ease with which he could burn away the membranes and separate the lobes. You see the analogy I'm sure. What stuck me though is that the advantage of one technique versus another is a question of how well they exploit the existing segment to pith ratio.

I was also reminded of a different conference a few years back when I saw a presentation about the epidemiology of stomach and esophageal cancer. ICD-10 coding lumps the upper, middle and lower part of the esophagus together and separates them from the cardia and the rest of the stomach. However, in terms of tissue types (which strongly affect the aetiology of cancer), the lower esophagus is closer to the stomach than it is to the upper esophagus. Here the physical segmentation of the body has acquired a parallel histological coding and the choice between them is the difference between clay and chocolate.

So generalising... applying coupling theory is a nice mini-narrative for considering systems with complex inter-dependent parts, and the possible interventions in them. Happy holidays.

Wednesday 9 September 2009

The Dependency Principle

I’m taking a week’s break from consideration of Popper’s work. It seems timely to mention one of the differences between science and information analysis – in general analysts don’t do their own experiments (i.e. generate their own data), while scientists who work directly with experimental data in general do.

Information analysts tend to have their data handed to them. Sometimes from a different department in their organisation, sometimes from entirely outside. Let’s pretend for now that that is the only difference between the role of experiment in the work of analysts and scientists (reserving an exploration of the others for a later post). The consequences that the analyst lacks an overview of the quality, completeness, and coding of the data.

Let’s briefly look at a sketch of the analytical process after the data is received:

Firstly there is (hopefully!) some sort of data checking, cleaning and familiarisation.

After that we might do some pre-processing of the data to turn it from raw data into something that encodes a bit more meaning (for instance turning counts into rates, or grosses into nets).

Then we could get clever with some sophisticated multi-variate analysis, GIS, data-mining or somesuch.

After that we can maybe draw some conclusions, bounce our results off other people’s work. We could bring in causality and the Reverend Bayes and start feeling pretty impressed with ourselves.

But look down… we’re standing on the fifth step of a five step structure, and the foundation was outside our control.

The name of this post draws an analogy between the structure above and a concept introduced by Iain Banks in his novel Excession. This is an unashamed but entertainingly modern space opera which has as it’s most engaging protagonists a bunch of artificially intelligences – minds – and the spacecraft that effectively make up their bodies. As superhuman intelligences they are frustrated with the slow (to them) action in the real world and pace of human affairs. They spend their leisure time in self-created artificial universes beyond the ken of man. But look down…

“... if you spent all your time having Fun, with no way back to reality, or just no idea what to do to protect yourself when you did get back there, then you were vulnerable. In fact, you were probably dead... it didn't matter that base reality was of no consequence aesthetically, hedonistically, metamathically, intellectually and philosophically; if that was the foundation stone that all your higher-level comfort and joy rested on, and it was kicked away from underneath you, you fell, and your limitless pleasure realms fell with you.” - Iain Banks, Excession

You are completely reliant on your data collection methods and lowly paid data gathering staff, no matter how clever you are or how advanced your analytical techniques. That’s The Dependency Principle.

Wednesday 2 September 2009

The logic of scientific discovery, Part 2

Continuing on with the Popper...

Chapter 2, On the Problem of a Theory of Scientific Method

This is a short chapter which re-iterates and expands several of the points already made, emphasising the methodologically contingent quality of science.

10. Why Methodological Decisions are Indispensable

Unlike some philosophers of science Popper doesn't see it as a logical system which can produce either provably correct results, nor (absolutely) prove something false - there's wiggle room both at the top and the bottom. The theory of science itself is a built thing. In this picture of science methodological decisions are indispensable, to sew up the loopholes (cherry picking data, rewriting hypotheses post-hoc, and similar naughtiness).

11. The Naturalistic Approach to the Theory of Method

Here Popper spends a few pages talking down orthodox positivism (too-brief summary: positivism is the doctrine that the legitimacy of all knowledge is based solely in personal experience). This position, popular at the time of Popper's writing, requires that a theory of science can be only be based on an empirical study of scientists themselves - purely empirical, experience based, and 'naturalistic'.

This kind of absolutism - that modes of thought outside of personal experience literally have no meaning - sounds odd to modern ears. Popper contends, more relativistically, that there may (potentially at least) be many possible sciences with or without this or that feature or methodological rule, as long as they stay free of internal contradiction. The nature of science itself is not empirically discoverable, but decided.

12. Methodological Rules as Conventions

We see Popper use the analogy of the Logic of Science as the rules of a game. He then sets out some of the more important ones, to his way of playing:

The game never ends. Hypotheses can always be falsified. It always cheers me when I see today's young punks going after gravity.

Winner plays on. Once a hypothesis has been tested against the evidence and corroborated to some extent it doesn't drop out of the game until it's falsified or replaced by another one with a wider scope (the scope of hypotheses to be defined later).

Ya gotta take ya licks. This is a kind of higher order rule that specifies that no hypothesis should be protected from the rough and tumble. To do so would make it unfalsifiable and exclude it from the business of science altogether.

These are by no means an exhaustive set of those proposed in the rest of the book - this chapter is just a taster. Though in fact Popper purposely avoids writing a science methodology shopping list. The further rules, their nature, and the connections between them are the main subject of the rest of the book.

Chapter 2 Summary

So then, we've seen science defined methodologically. Some of it's main structures have been exposed and tied into the partial logical framework constructed in Chapter 1. Popper's science is a built thing and stands on the merit that its insights are useful.

Popper quotes:

"Definitions are dogmas; only the conclusions drawn from them can afford us any new insight" - Karl Menger

Though to what extent does science tell us about the world as it is? To what extent are our hypotheses true? Good questions both...

Tuesday 25 August 2009

The logic of scientific discovery, Part 1

In thinking about what this blog is for I've drifted more and more into considering question about the production of new knowledge (then again this could just be the ultimate in yak-shaving). How does one generate knowledge from data. What are the bounds on the validity of this knowledge? To what extent is it valid to infer causal relationships linking related data?

Many systems for generating new knowledge exist. Formal and informal, specialist and general, tacit and codified. However the scientific process is one of the best documented, has at least a significant overlap with the process of information analysis, and highly influential in today's world. I'm therefore going to look at it first, and in depth.

The process of the scientific method has been quite a battleground in recent decades, as CP Snow's two cultures express their misunderstanding of and distain for each other. The scientific method in theory and to some extent in practice relies on Karl Popper's The Logic of Scientific Discovery, somewhat modified by later contributors.

The approach I'm going to take is to examine in detail's Popper's philosopy of scientific knowledge, then the significant modifications and other contributions, then finally explore the current situation and relevance to information analysis. All chapter section and page references are to the 2002 Routledge Classics written by Popper in 1935 and updated by him up to 1980.

So then... Chapter 1, A Survey of Some Fundamental Problems

This first chapter serves as an overview of the ground that Popper covers in much more detail later on.

1. The Problem of Induction

Popper firmly rules knowledge generation using inductive process out of his scientific logic. His reasoning is that he can't see a way to justify the process of induction except by using induction (leading to an unsatisfying infinite regress) or by just assuming that it is valid, which he can't justify. So... no induction.

2. Elimination of Psychologism

The process of formulating hypotheses is ruled out of the logic of the scientific process (though they may be of some interest psychologically). Science is there to deductively test the hypotheses that are generated. Hypotheses can be generated any way you please: creative intuition, some irrational inspiration or perhaps (whisper it) by contemplating data inductively.

According to Popper hypothesis generation is pre-match warm-up and not subject to the rules of the scientific game. This concerns itself with the process of rationally testing and eliminating these hypotheses.

3. Deductive Testing of Theories

The game then, is to deductively test the theories that we have (somehow) generated. This can proceed along four lines, a) is the theory self consistent, b) is it an empirical theory - can it make any statements about the world, c) does it repeat, add to, or contradict existing theories, and d) does it produce conclusions that contradict experimental evidence.

If the theory passes the first two tests then it is in the game and Popper's business is to provide a deductive method by which we can examine it in the light of other existing theories and against the evidence.

Here we encounter the key point that exposure to experimental evidence can never verify a theory, it can only (by contradiction) falsify it. No scientific theory can be proved right, they can only be more and more highly corroborated by accumulation of supporting evidence.

4. The Problem of Demarcation

Demarcation is the boundary between metaphysical systems of knowledge (i.e. those which are knowable by pure thought - maths, logic and the like) and scientific systems of knowledge. The Problem of Demarcation that Popper addresses is how to define this boundary in the absence of induction.

Popper surveys and dismisses a range of inductivist approaches to the problem before stating that the structure of his own solution is one of convention and agreement. It is a method, rather than a principle. It is agreed upon rather than derived or proved. And we use it because it works.

5. Experience as a Method

Popper starts to assemble his method: it must be non-contradictory, representing a possible world of experience; it must not cross over into metaphysics; and it must represent our world of experience.

We do this by, yes, deductive testing of theories against empirical evidence formed from our experience of the world.

6. Falsifiability as a Criterion of Demarcation

How do we judge our hypotheses against the evidence? Popper advocates using falsifiability. We cannot, even in principle, verify our hypotheses but we can prove them false: "all swans are white" remains ever-vulnerable to the discovery of a black swan in some dry land... which immediately falsifies it.

But there are loopholes. We could try to preserve our exclusive white swan hypothesis by introducing auxillary hypotheses ("There exists a disease which in some circumstances causes naturally white swans to turn black"), changing definitions ("of course I was referring to European swans"), or refusing to recognise the evidence ("I see no swan"). Some variations on these themes are probably familiar from real-world debates about matters of fact.

This is where the methodological aspects of Popper's theory come in, essentially part of his definition of science is a gentleman's agreement not to indulge in this kind of milarky.

7. The Problem of the Empirical Basis

So we can generate theories, and falsify them by comparing against empirical statements of fact... but from where do we get our statements of fact? They would seem to arise, somehow, from our perceptual experience of the world but how can we put them on a firm footing? This is the Problem of the Empirical Basis, and is addressed in the 5th chapter.

8. Scientific Objectivity and Subjective Conviction

Where then does all this leave scientific objectivity? We can't know anything for sure, and we can only dismiss proposed theories if we are convinced that people are following the methodological rules.

Popper hangs scientific objectivity on inter-subjectively testable (i.e. non-falsifiable) theories. Science is what can be repeated between different scientists (and defining "scientist" as someone who follows the scientific method ruling out the various forms of shenanigans).

What role do our convictions of the rightness of a theory play? None directly, which comes as no surprise to a modern ear - according to this method there is no corroboration to be had for a theory by saying "I believe it". I get that this wasn't necessarily the case at that point in the great conversation.

Chapter 1 Summary

So then, that's Popper's method in a nutshell. This is a brave piece of work.

It was surprising to me just how assembled the foundations of the scientific method are. Popper just keeps on trucking through a blizzard of problems. He rejects half of inference at the get-go and runs with deductive logic alone. We lose the verification of theories: all we know is what we don't know. We reach for objectivity from a basis of subjective experiences, and - by necessity - embrace methodological rules to assemble the results allowing us to compare our theories against the world. The surprise, and the delight, is that after all of this we still find something that works.

To quote from later in the book:

"Science does not rest upon solid bedrock. The bold structure of its theories rises, as it were, above a swamp. It is like a building erected on piles. The piles are driven down from above into the swamp, but not down to any natural or 'given' base; and when we cease our attempts to drive our piles into a deeper layer, it is not because we have reached firm ground. We simply stop when we are satisfied that they are firm enough to carry the structure, at least for the time being." - Karl Popper

Wednesday 1 July 2009

No moving parts

May was not a good month for computer hardware at my house. First to go was an external hard disc drive, half a terabyte on its way to the information hereafter. Then my inceasingly flakey 2-year old laptop BSOD'd itself to death in a spasm of tourettes-like non-sequiteurs. Finally my venerable 6-year old desktop shuffled away to join them in silicon heaven. So either one or two HDD failures (beware the IDEs of May!), or one HDD and a CPU, plus a case of general senescence. No personal data lost, but a not-inconsiderable amount of hassle and expense.

So then... this time it will be different. After a bit of a think about what I actually use my boxes for at home I ordered 1±1 netbooks from Dell.* No CD/DVD, no HDD, no fan and the disc is replaced by a SSD. No Moving Parts. And I ordered them with Ubuntu Linux.

Windows is the ultimate moving part. The mean time to effective failure, i.e., the point at which you switch it off and on to make it behave, is maybe two or three days. The mean time to total failure (reinstall the OS) is maybe 18 months.

Now I'm no slouch with windows. I've installed nearly every version from 3.1 to XP (I skipped ME)and have been the go-to guy for windows problems amongst friends, family and colleagues for most of that time. For a couple of years, in my spare time, I supported a network of 6 Win2K boxes at a lab I worked in. Not professional support work perhaps, but the next best thing. For me RegEdit holds no fear.

So why am I binning all that experience? Not just with the OS but with all the OS-specific applications. There's actually, a myriad of reasons, but just to pick one... it's the lack of transparency. As touched upon in last week's post it's the sense of opacity. Your understanding can proceed this far and no farther. After that it is black box built upon black box. Learning how to work in this environment isn't real learning, it's voodoo, it's mysticism. It's shaping your expectations to the arbitary failings of an chinese room with ADHD. It's turn three times and spit to avoid the wrath from high atop the thing... There are times that, I swear, if I could have sacrificed a chicken over a windows box to make the damn thing work I would have done so with no compunction.

To be honest though, this isn't the first time I've installed Linux. Back in 2003 I installed Red Hat as a dual boot on my desktop box. It was a real pain to do and I never managed to get the graphics drivers working right. Consequently the screen flicker would give me a blinder of a headache after about 30 minutes. I used it less and less frequently and in the end didn't bother restoring it after the next sesquiannual windows reinstall. Both by reputation and in my experience (albeit 6 years ago) linux has a vicious learning curve on it.

But this time I planned for that. Buy a system with it pre-installed and the learning curve looks like this

instead of like this

To be fair to the Ubuntu guys though the thing did work out of the box. Easier, I'd have to say, than the typical windows install. Getting onto my home wifi was a two-click process. A leftover 3-button mouse and (more impressively) my Intuous-2 graphics tablet worked on plugin with no software jiggery-pokery necessary.

Of course I kept fiddling with things until I broke something. Before I quite grokked the escalate-to-admin privilege paradigm I created a admin account called "admin". This seemed to be a reserved word and resulted in me suddenly having no accounts with admin privileges. Strange choices in the boot configuration by Dell made this harder to fix than it need have been, and made my actual learning curve look like this

but we got there in the end.

Am I happy with Ubuntu Linux? Oh yes. The GUI is slick, professional, and easy to use. Going back to the command line underneath it was something like returning to the UNIX labs of my youth. Pipes and man pages. Chmod and grep and awk and sed. Really though the reasons are the inverse of those for bailing on Windows. The sense of transparency and trustworthiness, of not being played for a sucker. Finally, finally, leaving the voodoo tribe.

* I order one. Dell cancel the order. I order another. Dell uncancel the first order and promise to deliver both. The idea of two netbooks to play with wasn't entirely unpleasant so I just rolled with it at that point: "Order from Dell and hope like hell".

Wednesday 24 June 2009

To Excel or not to Excel?

Microsoft software product names are a pretty good source of ironic amusement. Of those I use regularly “Excel” and “Access” vie for the top spot, which is probably why I found the paper on the HSUG mailing list a week or two ago by David Lawunmi so interesting.

Lawunmi uses a wealth of interesting references to tour some of the drawbacks of Excel. These include - without rehashing his entire article - a variety of statistical flaws or missing functions, use of binary rather than decimal floating-point arithmetic, and generally appalling charting. More damningly to my mind though is the lack of record keeping involved in analytical spreadsheet work. Generally at the end of a piece of Excel work you’re left with your raw data file, your results, a mish-mash of semi-processed data, and a grim smile. There’s no audit trail or log of what you’ve done, and your results can’t be replicated by anyone else (or often yourself). Not that using spreadsheets makes good practice impossible of course, but they certainly do discourage it.

The analogy between programming projects and spreadsheets that Lawunmi refers to from the NPL document is pretty close, but a tighter one is to the subset of scripting languages within the set of programming languages. Quick and dirty solutions to small problems, and a method of making large problems functionally insoluble. However, friends who work in the software biz tell me that wherever there is an interface between large proprietry systems you will find a nest of script jobs providing the cushion and the interface. And of course we live in real world. Excel is ubiquitous and we will all continue to use it:

“Excel is utterly pervasive. Nothing large (good or bad) happens without it passing at some time though Excel”
Source: Anonymous interviewee quoted by Croll.

So where do we go from here?

You’re going to use Excel for many (most?) things but make the effort to supersede it where it counts – don’t get sucked into software monolingualism. For proper stats use a proper stats package. For proper graphs use a proper graphical package. When you do use Excel be careful, and if you can't be careful be lucky. Or perhaps you could move to Open Office. Hmmm, must check that out...

In the final analysis perhaps the question should be rewritten wryly: “To excel when Excel doesn’t?”

Wednesday 17 June 2009

Generating Principles for Analysts, Part 1

I was reading the Overseas Development Institute's Exploring the Science of Complexity today (more on this again), which I found through Duncan Green's blog From Poverty To Power.

One of the examples it gave of the generation of complex behaviour from simple rulesets was that of the US Marine Corps who apparently use "Capture the high ground", "Stay in touch", and "Keep moving" (I haven't managed to confirm this, though the principles are certainly in keeping with the maneuver warfare principles of John Boyd).

Regardless, the question is begged... what might these be for analysts?

I think I got a line on the first one a few weeks ago when I was facilitating a workshop about the main points in Tufte's books. He spends a fair amount of time in The Display of Quantitative Information on how graphs are made up of "data ink" which conveys information, and "non-data ink" (including chart junk), which doesn't. One of his examples takes the original graph (Linus Pauling, General Chemistry (San Francisco, 1947), p. 64.) ,

Source: The Visual Display of Quantitative Information, Edward Tufte

and with the non-performing non-data ink removed (and just a little bit of data ink added), we have

Source: The Visual Display of Quantitative Information, Edward Tufte

Much better - reads at a glance.

It occurred that the concept of data ink vs non-data ink is another way of expressing the signal to noise ratio.

Errors are a form of noise. Lack of clarity is noise. Extraneous information is noise.

So... Maximise Signal:Noise as a first generating principle. Not bad.

Wednesday 10 June 2009

"A city is not a tree"

Christopher Alexander's 1965 essay (Part I & Part II and also Part I with all diagrams intact) has a great title, but deserves its status as a classic for its ideas. It argues that the elements of a city: the people, the buildings, the businesses, and their interrelations are not well represented by a mathematical tree structure but much better by a mathematical semi-lattice. The complex interrelations between these elements can be much better described by a structure that allows them to overlap.

Isn't this obvious? To me at least, only in hindsight. When I first read it I had one of those "Ah-Hah!" moments as the idea struck me. Of course it had to be that way, except I'd somehow never known it.

And to this day tree-patterned suburbs continue to be built around many urban cores, which suggests that it either isn't obvious to the planners either, or that Alexander was also correct in advancing the idea that people have a cognitive barrier to the representation of concepts in semi-lattice structures. Another cognitive problem to look out for I guess...

So, what has any of this got to do with information analysis? The lesson I get out of this is that I need to get the analytical framework right before I can hope to get anything useful out of my data, and any work I do based on the wrong framework is worse than useless.

And wider that that? That the systems that interest us most are richer and more complex that we find it easy to conceive. That seems like a good place to leave this first post.

The Analyst's Cookbook