A View from the Keyboard: October 2015

Saturday, 31 October 2015

Python Weekly # 3 : Incremental development

Incremental Development

or how a long journey should start with a single step

Unlike most of the other articles in this blog, this article really isn't about python, but the language does lend itself to this style of project.

When you start a software project, it is very easy to dive straight in and start coding - with a good idea of the end goal, but maybe with not much of an idea of how you get from a empty file to the final application. I have made exactly that mistake myself - and my Development folder is littered with half started "big" projects.

It is much more efficient to take small steps, coding, testing, debugging and then repeating until finished. This is Incremental Development, and there are a few good reasons why it works - and works well :

You know that each step works - so when you start your next step you are not introducing bugs on top of bugs.
You might well surprise yourself and finish early - as you realise you get to point where your application is good enough, and you don't need all the bells and whistle you thought you might need.
You get a powerful sense of achievement - each cycle you know your code works and you are one more step along your journey - a bit like seeing distance markers on a highway.

With Incremental Development you follow a few simple rules :

Your cycle (design, code, test, debug) should be no more than a couple of weeks long - and if possible even shorter.
Target a small number of features on each cycle - maybe only one or two features to begin with.
Start your cycle by working out how you will test your feature - and if you can start by writing your test cases first.
At the end of the cycle - make sure your latest working code is saved in some form - so if the next cycles goes horribly wrong - you can always go back. The best way to do this is to use a formal change control tool such as git, mercurial or subversion.
Within your cycle complete one feature before you do the next one - that way you have progress made even if the 2nd feature doesn't complete.
If a feature proves too complex, drop it from the cycle - and take time to refine the design or split it up by redesigning it in a brand new cycle.

The key to this process is the concept of a feature, which is best defined as follows :

A single piece of functionality which can be described by a simple sentence. One thing your software will do in order to achieve the whole aim.

This is best explained by an example :

One of my first home projects was an application to upload one or more images to flickr. To achieve this the application need to do a number of things including :

Connect to a flickr account
Get a list of Sets from flickr (a set is a group of images)
Upload a image to flickr
Add an image to a set
Record an optional title for the image
and many more (as you might imagine)

The application also included a GUI which needed to be developed which included a number of different elements.

Any one of those bullet points (and any one of the GUI elements) could be described as a "feature" - don't treat the GUI as a single feature - even with very good design tools, GUIs can be very complex, and maybe could be left to last.

Don't be tempted to make your features too complicated; if a features seems too easy (for example recording a title against an image) you will be able to develop more than one feature in a cycle. If you start trying to combine items and call the combination a feature - you run risk of something being far more complex, and you end up with a bigger failed feature.

Language features in python such as classes, modules and packages make it very easy to practise this style of development. Many features can be added as either new methods or new classes and since the language itself does not have a complex syntax which requires a lot of extra punctuation, it is relatively painless to write new code quickly. The python interpreter is also a massive benefit - giving you the change to test new code quickly, even without full test scripts.

To summarise - start slowly in small steps, and test as you go along - don't try to do everything at once. Don't be afraid to go backwards.

Thursday, 29 October 2015

Python gotchas # 2 : Names, dynamic typing, is and ==

Python gotchas - Names vs Variables

or how not to get caught out when you start using python

At first glance python seems very familiar, especially if you have used other procedural languages such as C or Java - but actually Python is different - in some cases very different, and those differences can trip you up as you progress along your python journey. In this series of occasional posts, I am going to cover some of those gotchas.

Names and variables

In most other languages a variable holds a specific value, and a change to the variable changes the value that variable holds. That is true to a certain extent in Python, but critically there are some differences. Python doesn't have variables in the truest sense of the meaning - it has names and objects.

Everything in python is an object - numbers, strings, lists are all objects, as are the items in the list for instance, and names are simply references to those objects. (Classes and functions are also special types of object - I will cover that in more detail in a later blog post).

This process of making a name refer to an object is called binding, and a name is said to be bound to an object.

Take the following code :

################ Example 1
>>> var1 = 3
>>> var2 = var1
>>> id(var1) == id(var2)
True

This gives us two names (var1 and var2) both bound to the same object (The id function returns the object Id which the name is bound to). There is a int object with the value '3' to which both var1 and var2 are bound.

We can bind var1 to a different integer object - without affecting what var2 is bound to.

################ Example 2 
>>> var1 = 3
>>> var2 = var1
>>>id(var1) == id(var2)
True
>>> var1 = 4
>>> id(var1) == id(var2)
False
>>> print var1, var2
4 3

When you bind a name to an object the rules are simple - nothing is copied, the name is simply created and made to refer to that object - in terms of the C language - the name is a pointer to an object. The big difference is that in Python since every name is a reference you don't need a special syntax element to get to the object being refereed to.

When we increment our number :

################ Example 3
>>> var1 = 3
>>> var2 = var1
>>> id(var1) = id(var2) # bound to the same object ?
True
>>> var1 = var1 + 1
>>> id(var1) = id(var2) # bound to the same object ?
False
>>> print var1, var2
4 3

Here we change the value of the object that var1 is bound to, and var2 is left bound to it's previous object, while var1 is now bound to a new object.

So what happens when our names are bound to a list object, and we change the list - do we get two list objects :

################ Example 4
>>> l1 = [1,2,3,4]
>>> l2 = l1
>>> id(l2) == id(l1) # bound to the same object ?
True
>>> l1[0] = 4
>>> id(l2) == id(l1) # bound to the same object ?
True
>>> print l2
[4,2,3,4]

We can see that even after the change, the l2 name is still bound to the same object as l1, and therefore the change made to the object, is reflected in l1 and l2.

The difference between example 3 and example 4 is entirely down to whether the object is immutable : lists (and other data structures such as dictionaries) are considered to be mutable - i.e. they can be changed without the creation of a new object, and therefore a change in the object is reflected across all the names bound to it.
In the case of integers and a number of other types, the objects are considered "immutable" (i.e. cannot be changed) and attempting to change the value (adding 1 to var1 in example 3 results in the creation of a new object - in this case new integer object (value 4) which var1 is then bound to.
(In fact when python starts up integer objects already exist for all values between -5 and 256 inclusive - as these values appear most regularly across most type of program, and having these objects ready saves time as your program runs).

So far the end result is probably similar to what you might expect : after all you can't redefine 3 to be 4, but you can change the content of a list.

The main tripping points are that for the mutable types there are many different way to change the object - and some don't result in the reflection one might expect :

>>> ################ Example 5
>>> l1 = [1,2,3]
>>> l2 = l1
>>> id(l1) == id(l2)
True 
>>> l1.append(4)   # append changes the object 'in place'
id(l1) == id(l2)
True
>>> print l1, l2
[1,2,3,4] [1,2,3,4]
>>> l1+=[5]       # this form of list addition also operates 'in place'
>>> id(l1) == id(l2)
True
>>> print l1, l2
[1,2,3,4,5] [1,2,3,4,5]

>>> ################ Example 6
>>> l1 = [1,2,3]
>>> l2 = l1
>>> id(l1) == id(l2)
True 
>>> l1 = l1 + [4] # This form of change creates a new object
>>> id(l1) == id(l2)
False 
>>> print l1, l2
[1,2,3,4] [1,2,3]

In example 6 - a new object is created (when the expression l1+[4] is evaluated), and since we have a new object, l2 is no longer bound to the same object as l1 - i.e. the reflection is broken.

The key phrase here is "in place" - many of the standard functions modify the object "in place" - i.e. modify the object without the creation of a new object, and this type of modification will always ensure that changes are reflected through all bound names, but there are equally many ways to apparently change the value of a mutable object which in fact will result in the creation of a new object - if in doubt you can always open your python interpreter and try it out - remember you can check the id values (or use the is operator discussed below).

Names and dynamic typing

Unlike in C where you have to declare what type a name/variable is before you use it, a name in Python can be bound to any object you want, and be bound to a different type of object later on - this is called dynamic typing - and is one of Python's greatest benefits - var1 can be an integer, and then a floating point number, and then a string. With that flexibility comes great strength, and also potential pitfalls.

In a complex application without clear boundaries it is easy for the developer to loose track of what type that variable should be at this point, and to try to do something which either results in an error or even worse a subtle bug, because for instance the value that should be an integer is actually a string :

>>> ################ Example 7
>>> var1 = 3
>>> print "Final value %"%(var1*3)
Final value 9
>>> ......
>>> ...... # Some time later
>>> var1 = "xxx"
>>> print "Final value %"%(var1*3)
Final value xxxxxxxxx

It is strongly suggested by the author that in any complex program you use sensible names for your values (not var1, var2 etc) - this naming will help prevent some of the worst of the challenges that Dynamic Typing can bring.

There are also methods of testing what type of object a name is bound to, I will cover that in a future blog post.

And finally - a warning about 'is' and '=='

Many python beginners get confused about when to use the 'is' comparison and when to use '=='.

The '==' operator should be used when you are testing whether the two names have the same value - i.e. the objects they are bound to have the same value - for lists for instance this is whether the two lists have the same content in the same order - it is highly likely that you will use '==' far more often that you use 'is'.

The 'is' operator should be used when testing whether two names refer to the same object.

For trivial examples : integers between -5 and 256, and short string literals (less then 20 characters), then 'is' and '==' will return the same result (due to optimisations already mentioned) which can lead you into a false understanding of what the operators do. Using values outside those ranges will show the distinction :

>>> ################ Example 8
>>> a = 10*100
>>> b = 1000
>>> a is b        # bound to the same object ? equivalent to id(a) == id(b)
False             # NO
>>> a == b        # But definitely the same value
True
>>> l1 = [1,2,3,4]
>>> l2 = [1,2,3,4]
>>> l1 is l2      # Not bound to the same object 
False
>>> l1 == l2      # But again definitely equal in value
True

The 'is' operator is equivalent to comparing the id value.

It is expected that if two names are bound to the same object (var1 is var2) then the value comparison (==) will also be the same - this will always be the case with objects created by the standard library. As demonstrated in Example 8, the reverse is not true - if two names are not bound to the same object, their values could still be equal.

Sunday, 25 October 2015

Python gotchas # 1 : Indentation Tabs and Spaces

Python gotchas - Indentation

or how not to get caught out when you start using python

Indentation, Tabs and spaces

As you no doubt already know by now, Python uses the end of a line (in most cases) to delimit statements and uses indentation to identify blocks of code in a loop or function, and Python allows you to use either Tabs or spaces to create your indentations and even mix both Tabs and spaces on the same line but there is a hidden gotcha.

Tabs and spaces look alike in most editors - but if you indent one line with Tabs - and the next with spaces - Python will complain with an IndentationError or a TabError even though the lines will seem to be perfectly aligned in your editor.

This can be such a problem that the official style guide (PEP 008) recommends only using spaces for new code, and setting your editor to insert 4 spaces when the tab key is pressed, and you should only use Tabs to stay compliant with existing code which you might be changing. Make sure you get your editor set up before you start coding - as finding and replacing those tab characters might be tricky when you have several 100 lines of code.

The use of indentation provides another thing to trip over if you are not wary. Since Indentation is important for structure the code you can change the behaviour of your code by indenting or un-indenting a line :

Example 1

flag = False
if flag :
    print "Flag is True"
print "Continuing"

Example 2

flag = False
if flag :
    print "Flag is True"
    print "Continuing"

Example 1 and 2 above are identical, apart from one having the last line indented, but Example 1 generates a line of output, and Example 2 doesn't. You can easily see how this could drastically change how your program behaves, and there isn't a error generated, as both of these are completely valid.

Special care should be taken if you copy and paste code into help sites such as stackoverflow which will often remove formatting, make your code invalid, and make it difficult for other people to help you

Saturday, 24 October 2015

Python Weekly # 2 : 10 things to learn as a Python Developer

10 things to learn as a Python Developer

Whether you are contributing to Open source projects, or simply creating applications for your own use, there are a number of tools that you really should learn - theses tools will make your life so much easier.

These are not recommendations for beginners - In fact I would strongly suggest as a beginner you don't worry about extra tools until you are confident with the basic language capabilities - including how to write Object Oriented Code, and how to separate your code into modules and Python Packages.

An IDE - A good IDE can shave significant time off your development cycle. There are many good IDEs available : IDLE, Eclipse, PyCharm (to name but a few). My personal favourite is PyCharm - I will explore why in another post. There are few projects which will insist collaborators using a particular IDE, although of course some work environments may have standard IDEs which you need to use.
Choose a unit test framework or two and get familiar with then. The ability to design, construct and maintain unit tests is invaluable in any serious development project, as you will find your code quality increases (as does your confidence in the code), if you have a set of test cases that you can execute on a regular basis. unittest (Python 3.5), doctest (Python 3.5) and nose (if you are using unittest) are a few examples.
A mocking framework : Mocking is an incredibly powerful technique which allows you to replace a module with artificial version, allowing you fine control over your testing, and allowing you to test your code in ways that might otherwise be impossible (or at least very difficult) to reproduce. A common mocking framework is mock (from Python 3.3 onwards this is part of the standard library)
Test Driven Development : A very powerful technique which effectively boils down to defining what your code will do before you start writing it. The difference is that instead of defining the code in a document, you define your code in terms of unit tests which will pass once your code is complete. During development regularly run your tests, so you can understand what is left to develop.
A Debugger : It is highly unlikely that your code will be developed bug free - so when you find a bug, knowing how to use the debugger will be significant. Certainly using a debugger well is far more efficient than splattering your code with multiple print statements. Most IDEs (see above) will come with an inbuilt debugger, and of course you can use pdb (Python 3.5). Remember once you find and fix the bug - write a test case for it, and include it in your standard test cases that you run.
Data Persistence : Any reasonably complex project will probably have a need to store and retrieve data between executions, and for that you will need to decide how that data is stored and retrieved. Options include : csv files (Python 3.5), json (Python 3.5), pickle (Python 3.5) or sqlite (Python 3.5). Personally I tend to use pickle for small low volume objects, and sqllite for larger volume data storage.
Documentation : Don't forget to document your code, and do it as you go. Make sure your code is readable, and that the comments and documentation strings explain what your code does and why (in preference to how it works). Also make sure you have good overall documentation on your complete application - README files, user guides etc. It is surprising how quickly you can forget a piece of code that you wrote a few weeks ago - and end up staring at it for hours trying to work out what it does.
Virtualisation : You will probably want to ensure that you isolate your code as you test it, to ensure that you don't destroy your working system. If you don't have a separate test machine, then this could be as simple as creating a clean working directory, but you might well need to use virtualenv or even virtualbox depending on what your code does and how far you want to isolate your testing.
Change Control : Even if you have no intention of publishing your code, making use of a good change control system is highly recommended - if nothing else it provides you a long term undo for all of your code. If you learn to use long lived branches, you can keep your next development independent of your working version. Examples include : git, mercurial, and subversion (also called svn). If you decide to contribute to Open source projects though, you will need to be familiar with the change control system that they use.
Packaging : Even if you are just creating code for your own use, if you want to ensure a clean installation every time, you probably want to look at packaging your code - either using the Python Packaging tools, or Debian Packaging tools (depending on how your code needs to be installed).

Saturday, 17 October 2015

Python Weekly # 1 : Notes on developing an import hook

Notes on developing an import hook

Before we start I should give a warning, this article covers import hooks in python 2.7. The hook mechanism has changed from Python 3.5, although the mechanism in Python 2.7 is still supported.

I am a big fan of using JSON to allow configuration of my application, especially where the configuration is more towards the back end functionality. Many of of my applications have variations of either Class or object factories based on JSON data - using JSON dictionaries as templates for the formation of individual classes or instances, and I soon realised that a more "standard" way of approaching JSON would be ideal.

Using the Python import hook mechanism (See PEP 0302), I set out to develop a way of importing JSON file, and automatically generating classes etc from the data within the JSON - an implementation of data as code.

The result is the importjson library - (available as source on GitHub, and an installable package from PyPi - Python Package Index)

This article documents some of what learned during the development process.

Python exposes two types of import hooks - sys.path_hooks and sys.meta_path :

sys.path_hooks: These hooks are best used when there will be a specific entry on sys.path which only needs the special importer - for instance the zipimport module uses this mechanism.
sys.meta_path: These hooks are appropriate when a special sys.path entry will not exist - i.e. where the things to be imported could exist in anywhere.

It was clear that sys.meta_path is ideal for my purposes, since in my applications often co-locate JSON files with my python files - and I am sure this is the case with many other developers. Also using sys.meta_path mechanism would still work if the JSON files are kept separate from the python files.

The Importer Protocol is well documented in PEP 0302, so I will not cover the details here, but I will cover a few important notes for anyone trying something similar - things I tripped over.

Installation

Your importer not only needs to implement the Importer Protocol, but it will also need to install it self onto either sys.path_hooks, or sys.meta_path.

An entry in sys.meta_path should be an object instance which implements at least the find_module method from the Importer Protocol, while an entry in sys.path_hooks will need to be a callable which when passed a path (from sys.path), returns an object which implements at least the find_module method from the the Importer Protocol (the callable can clearly be a class which accepts the sys.path entry as an argument to the constructor).

Reuse and caching

Regardless of whether the Importer is installed on sys.path_hooks or sys.meta_path the instance will be reused, so care should be taken when deciding which data should be stored on a per instance basis.

Due to the way that Python caches sys.path_hook entries (sys.path_importer_cache), once an Importer successfully finds one module within an sys.path entry the same Importer will be used to load all other modules in the same sys.path entry. This caching make a sys.path_hook type importer not suitable for situations where many types of importable file might exist within a given directory.

Translation of module names

The Importer Protocol consists of two methods which have a very simple signatures :

finder.find_module(fullname, path=None)

finder.load_module(fullname )

The detail of what these methods need to do are documented in PEP 0302, so I wont cover it again here, but I will note :

Both methods are passed the fullname of the module - i.e. the fully dotted name of the module being loaded, so both methods need to be able to translate the fullname of the module into a file name which can be read and converted to a module, and it makes sense for that translation to be consistent (I cannot think of a situation where you want inconsistency here).

In my module I solved this by the find_module method storing the translation of fullname to file name into a class wide dictionary, and the load_module method uses the same dictionary to identify which file to open.

How you do the translation depends in part whether you have a sys.path_hook or sys.meta_path installed Importer :

With a sys.path_hook importer, the importer instance is initialized with an entry from sys.path (i.e. a file path), and the find_module method is then passed the full qualified name and the path of the most immediate parent module which applies (or None is the module being imported is a top-level module). Using either the path from the initializer, or the path given as the find_module argument, you can identify the path to the specific module.
With a sys.meta_path importer, the find_module method is passed a fully qualified module name, and the path the most immediate parent module (or None if the module being imported is a top-level module). The importer should use the the path to the parent if provided, but the importer will need to determine how and where to look for top level modules. My JSON importer does it's own traversal of sys.path to find where to search, but there are other options.

And finally :

During the development cycle beware when reloading your importer module, as this could well result in multiple entries in sys.path_hooks or sys.meta_paths, which could lead to inconsistent test results - take care to remove old entries before reloading.

If during development your importer does go completely wrong, and breaks your ability to import other modules, you have a few different options :

Remove the importer entry from sys.meta_path
Remove the importer entry from sys.path_hooks and from sys.path_importer_cache
If all else fails, exit the interpreter and start again.

Writing your own import hook may seem scary at first, but in fact it is relatively simple - the key thing is coming up with the idea in the first place.