hcwang

MSC-data-wrangling

Posted on Apr 12th 2020 Edited on May 19th 2020 In Tools

Exercise

Take this short interactive regex tutorial.

The tutorial is great for anyone. To help us understand the material better, we can do those problems after the tutorial.

Find the number of words (in /usr/share/dict/words) that contain at
least three as and don’t have a 's ending. What are the three
most common last two letters of those words? sed‘s y command, or
the tr program, may help you with case insensitivity. How many
of those two-letter combinations are there? And for a challenge:
which combinations do not occur?
1
2
3
4
$ cat /usr/share/dict/words | \
grep -E ".*([Aa].*){3,}[^Ss]$" | \
grep -o -e '.\{2\}$' | \
sort | uniq -c | sort -nr | head -n3

[Missing Semester in CS] Editors (Vim)

Posted on Apr 10th 2020 Edited on May 19th 2020 In Tools

Exercise

Complete vimtutor. Note: it looks best in a
80x24 (80 columns by 24 lines)
terminal window.
1
$ vimtutor
Really fantasitc learning material. Also remember the notes inside the tutor file: You should be learning by doing, not memorization.
Download our basic vimrc and save it to ~/.vimrc. Read
through the well-commented file (using Vim!), and observe how Vim looks and
behaves slightly differently with the new config.

fluent-python

Posted on Apr 5th 2020 Edited on Sep 20th 2020

Notes

Check the major difference between Python2 and Python3 at What’s New in Python 3.0
- Views And Iterators Instead Of Lists
- PEP 238 Changing the behavior of division
- All text is Unicode
REPL(read-eval-print-loop), Doctesting. Writing tests first as TDD suggests.
“dunder methods”: double under function, i.e. __getitem__.
Use of collections.namedtuple to construct a simple class to represent individual cards.
1
Card = collections.namedtuple('Card', ['rank', 'suit'])

Python Data Models

len(some_object) rather than some_object.len().
Implement __getitem__ to make an object iterable, then usable for some standard libraries!
CPython interpreter will call __len__ function for user-defined class, but will take a shortcut by getting ob_size in the PyVarObjectC for built-in objects like list, str, etc.
Avoid creating arbitrary, custom attributes with the __foo__ syntax because such names
may acquire special meanings in the future, even if they are unused today.
If we did not implement __repr__, vector instances would
be shown in the console like <Vector object at 0x10e100070>. The string returned by __repr__ should be unambiguous and, if possible, match the
source code necessary to recreate the object being represented.
__str__ should return a string suitable for display
to end-users. Check difference between __str__ and __repr__.
By default, instances of user-defined classes are considered truthy, unless either
__bool__ or __len__ is implemented.

Data structures

Array of sequences

Strings, lists, byte sequences, arrays, XML elements and database results share a rich set of common operations including iteration, slicing, sorting and concatenation.
Container sequences(list, tuple and collections.deque) hold references to the objects they contain, which may be of any type, while flat sequences (str,bytes,bytearray, memoryview and array.array) physically store the value of each item within its own memory space, and not as distinct objects.
Mutable sequences:
list,bytearray, array.array, collections.deque and memory; Immutable sequences:
tuple, str andbytes
List comprehension = listcomp, generator expression = genexprs.
Only do listcomp when you create a list, others should use for loop for more readable and less undetected side-effects.
In Python code, line breaks are ignored inside pairs of [], {} or ().
So you can build multi-line lists, listcomps, genexps, dictionaries etc.
without using the ugly \ line continuation escape.

Use tuple as records, where the position is vital for the meaning of records.

1 2	traveler_ids = [('USA', '31195855'), ('BRA', 'CE342567'), ... ('ESP', 'XDA205856')]

Unpacking. Tuple unpacking is widely used in Python. But Python also provides Extended Iterable Unpacking, where we use * to grab excess items.

>>> a, b, *rest = range(5)
>>> a, b, rest
(0, 1, [2, 3, 4])
>>> a, b, *rest = range(3)
>>> a, b, rest
(0, 1, [2])
>>> a, b, *rest = range(2)
>>> a, b, rest
(0, 1, [])

Nested tuple unpacking: The tuple to receive an expression to unpack can have nested tuples, like (a, b, (c, d)) and Python will do the right thing if the expression matches the nesting structure.

Named tuples: The collections.namedtuple function is a factory that produces subclasses of tuple
enhanced with field names and a class name — which helps debugging

>>> from collections import namedtuple
>>> City = namedtuple('City', 'name country population coordinates')
>>> tokyo = City('Tokyo', 'JP', 36.933, (35.689722, 139.691667))
>>> tokyo
City(name='Tokyo', country='JP', population=36.933, coordinates=(35.689722, 139.691667))

Other attributes of namedtuple, fields, make, asdict

>>> City._fields
('name', 'country', 'population', 'coordinates')
>>> LatLong = namedtuple('LatLong', 'lat long')
>>> delhi_data = ('Delhi NCR', 'IN', 21.935, LatLong(28.613889, 77.208889))
>>> delhi = City._make(delhi_data)
>>> delhi._asdict()
OrderedDict([('name', 'Delhi NCR'), ('country', 'IN'), ('population',
21.935), ('coordinates', LatLong(lat=28.613889, long=77.208889))])

Slice Object: slice(start, end, step) can be used to get sliced sequences from list, most importantly, we can define the slicing based by a name.

Assigning to slice will replace the corresponding sequences defined by the slice.

>>> l = list(range(10))
>>> l
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> l[2:5] = [20, 30]
>>> l
[0, 1, 20, 30, 5, 6, 7, 8, 9]
>>> del l[5:7]
>>> l
[0, 1, 20, 30, 5, 8, 9]
>>> l[3::2] = [11, 22]

Beware of expressions like a * n when a is a sequence containing mutable items.

>>> l = [[1], [2], [3]]
>>> b = l * 5
>>> b[2][0] = 10
>>> b
[[1], [2], [10], [1], [2], [10], [1], [2], [10], [1], [2], [10], [1], [2], [10]]

Something wierd from list inside tuple, see the example below,

>>> t = (1, 2, [30, 40])
>>> t[2] += [50, 60]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'tuple' object does not support item assignment
>>> t
(1, 2, [30, 40, 50, 60])

Takeaway from the example:

Putting mutable items in tuples is not a good idea.
Augmented assignment is not an atomic operation — we just saw it throwing an exception after doing part of its job.

Python API convention: functions or methods that change an object in-place should return None to make it clear to the caller that the object itself was changed, and no new object was created.
Because of returning None, Python is not naturally support fluenct interface. A fluent interface is normally implemented by using method chaining to implement method cascading (in languages that do not natively support cascading), concretely by having each method return this (self).
bisect.insort: inserts item into seq so as to keep seq in ascending order.
array.array is better than list when all items in the sequence are numbers. The array is actually C array, and supports all mutable sequence operations (pop, insert, extend), also supports additional write and read function (fromfile, tofile). Notably, the fromfile function reading 10 million floating number is 60 times faster than reading numbers from txt file since it write and read to binary file(also 50-60% smaller than txt file).
Memory views: the built-in memorview is a shared-memory sequence type that lets you handle
slices of arrays without copying bytes. It is inspired by NumPy library. It is useful when dealing with large dataset.
deque: double ended queue, optimized for modifying element from both ends. queue provides the synchronized (i.e. thread-safe) classes Queue, LifoQueue and PriorityQueue. These are used for safe communication between threads.
multiprocessing package has its own queue but designed for inter-process communication, asyncio also implements similar queues which are adapted for managing task in asynchronous programming.

Dictionaries and Sets

Dictionaries are everywhere in Python. The built-in functions live in __builtins__.__dict__.
The key must be hashable in dictionary, which implements __hash__ method. The atomic immutable types (str, bytes, numeric types) are all hashable. A frozen set(builtin set type which cannot be changed after created) is always hashable, because its elements must be hashable by definition. A tuple is hashable only if all its items are hashable.
dict comprehension - dictcomp: builds a dict instance by producing key:value pair from any iterable
Every Pythonista knows that d.get(k, default) is an alternative to d[k] whenever a default value is more convenient than handling KeyError. However I didn’t konw that before read the paragraph…

setdefault:

my_dict.setdefault(key, []).append(new_value)
# is equvalent to 
if key not in my_dict:
    my_dict[key] = []
my_dict[key].append(new_value)

Using defaultdict or another mapping implemented with __missing__, to properly handle the missing key in lookup.
A search like k in my_dict.keys() is efficient in Python 3 even for very large mappings because dict.keys() returns a view, which is similar to a set, and containment checks in sets are as fast as in dicts. Check this for explanations about dictionary view.
Other types of mapping (dict): collections.OrderedDict: maintains keys in insertion order, allowing iteration over items in a predictable order; collections.OrderedDict: holds a list of mappings which can be searched as one. The lookup is performed on
each mapping in order, and succeeds if the key is found in any of them. collections.Counter: a mapping that holds an integer count for each key; collections.UserDict: a pure Python implementation of a mapping that works like a standard dict, ususally desiged to be subclassed.

We should subclass UserDict rather than bulit-in dict, because there are shortcuts implementations that may cause problem. Here is the comparison for designing a new dictionary class with UserDict and dict:

class StrKeyDict0(dict):
    def __missing__(self, key):
        if isinstance(key, str):
            raise KeyError(key)
        return self[str(key)]
    def get(self, key, default=None):
        try:
            return self[key]
        except KeyError:
            return default
    def __contains__(self, key):
        return key in self.keys() or str(key) in self.keys()

import collections
class StrKeyDict(collections.UserDict):
    def __missing__(self, key):
        if isinstance(key, str):
            raise KeyError(key)
        return self[str(key)]
    def __contains__(self, key):
        return str(key) in self.data
    def __setitem__(self, key, item):
        self.data[str(key)] = item

Use type.MappingProxyType could generate a read-only but dynamic view of a dictionary, which prevents uncautious update of a dictionary.
Set class implements some infix operations union, intersection and difference etc., which could be used to reduce the loop thus the running time: here is an exaple to check the occurence of needles in *haystack:
1
found = len(needles & haystack)
set comprehension similar to list / dict comprehension.
The hashtable used by dict and set are powerful and should achieve constant key searching / retrieving time, also in practice it might be always constant, for example the memeory cannot fit the whole data, then memory fetching time may increase w.r.t. the size of data.
Implementation of hashtable used by Python: a sparse array of cells/buckets, each cell contains the reference to key and value. Python tries to keep at least 1/3 cells are empty.
due to the hash table implementation, adding items to a dict/set may change the order of existing keys. So DO NOT modifying contents while iterating through it in the context of dict and set.

Text and bytes

str is sequence of characters, while characters may have different definition and may lead to problem if not properly handled.
The actual bytes that represent a character depend on the encoding in use. From code point of the character to byte representation is encoder, the reverse is decoding. The byte representation is used for storage and transmission, while str is for human readability.

UnicodeEncodeError usually associates with source string having characters cannot be mapped to byte sequence with specified encoder.

>>> city = 'São Paulo'
>>> city.encode('utf_8')
b'S\xc3\xa3o Paulo'
>>> city.encode('utf_16')
b'\xff\xfeS\x00\xe3\x00o\x00 \x00P\x00a\x00u\x00l\x00o\x00'
>>> city.encode('iso8859_1')
b'S\xe3o Paulo'
>>> city.encode('cp437')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../lib/python3.4/encodings/cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character '\xe3' in
position 1: character maps to <undefined>
>>> city.encode('cp437', errors='ignore')
b'So Paulo'
>>> city.encode('cp437', errors='replace')
b'S?o Paulo'
>>> city.encode('cp437', errors='xmlcharrefreplace')
b'S&#227;o Paulo'

UnicodeDecodeError occurs when byte sequneces cannot be recoginized by specified decoder.

>>> octets = b'Montr\xe9al'
>>> octets.decode('cp1252')
'Montréal'
>>> octets.decode('iso8859_7')
'Montrιal'
>>> octets.decode('koi8_r')
'MontrИal
>>> octets.decode('utf_8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 5:
invalid continuation byte
>>> octets.decode('utf_8', errors='replace')
'Montr�al

SyntaxError occurs when the source Python file contains non UTF8 (default codecs for Python) chars.

When you are not sure about the string codecs, use chardet to detect.
Unicode sandwich as the best practice of text processing. (1) read and encode text from bytes to str; (2) processing in str， (3) decode to byte sequence and store back.
Garbled chars occurs when you read file with wrong decoder, ususally windows having its default “Windows 1252” will not compatible with some chars in UTF8.
Be careful between the codecs default in Unix and Windows system. check locale.getpreferredencoding() for the default encoding.
os and re packages could take str and bytes as arguments, with corresponding different handling stategies. Be careful about the type before passing it to those functions in os and re.

Functions as objects

Functions

Functions in Python are first-class objects (i.e. integers, strings and dictionaries). Programming language theorists define a first-class object as a program entity that can be:

created at runtime;
assigned to a variable or element in a data structure;
passed as an argument to a function;
returned as the result of a function.

Higher order functions: takes a function as argument or returns a function as a results. For example the function sorted is an example.

# sorted takes a function len as argument
>>> fruits = ['strawberry', 'fig', 'apple', 'cherry', 'raspberry', 'banana']
>>> sorted(fruits, key=len)
['fig', 'apple', 'cherry', 'banana', 'raspberry', 'strawberry']

In the functional programming paradigm, some of the best known higher-order functions are map, filter, reduce and apply. apply is deprecated and not used after Python3. And for other functions there are many better alternatives for most of their use cases.
map and filter can be replaced by listcomp and genexpr since they are more readable, sometimes more faster according to the link (also there are many good explanations of the difference between map v.s. listcomp).
The best use of anonymous functions is in the context of an argument list. Otherwise it is better to used the normal function definition. Check here for the refactor tips for bad lambda functions in Python.

Python3 proivdes pretty flexible arguments passing to a function, check here for more information. Official explanation is introduced by PEP 3102.

# understand more about how to right functions with variable positional arguments and keyword argument
>>> def f(a, *pargs, **kargs):
...     print(a, pargs, kargs)
... 
>>> f(1, 2, 3, x=4, y=5)
1 (2, 3) {'x': 4, 'y': 5} # the arguments are parsed to assign to different argument in the signature of the function
>>> def f_1(a, *pargs, **kargs):
...     print(a, pargs, kargs)
... 
>>> f(1, 2, 3, x=4, x=5)
  File "<stdin>", line 1
SyntaxError: keyword argument repeated # cannot parse duplicated keyword arguments
>>> f(1, 2, 3, x=4, y=5, 10)
  File "<stdin>", line 1
SyntaxError: positional argument follows keyword argument
>>> def f(a,  **kargs, *pargs):
  File "<stdin>", line 1
    def f(a,  **kargs, *pargs):
                       ^
SyntaxError: invalid syntax # cannot create positional arguments after keyword arguments.

Function annotations: Python 3 provides syntax to attach metadata to the parameters of a function declaration
and its return value. Here is an example, the only difference is its sigature in the declaration.

def clip(text, max_len:=80): # without annotations
...
def clip(text:str, max_len:'int > 0'=80) -> str: # with annotations
...
>>> clip.__annotations__
{'text': <class 'str'>, 'max_len': 'int > 0', 'return': <class 'str'>}

Note: The only thing Python does with annotations is to store them in the __annotations__ attribute of the function. Nothing else: no checks, enforcement, validation, or any other action is performed. They are just metadata that may be used by tools, such as IDEs, frameworks and decorators.

operator module implements many common functions that needed in functional programming, for example mul == lambda a, b: a*b, also itemgetter which helps you to get subitem from the item which can retrieve item by index (implements __getitem__), attrgetter can get values by keyword(s) from an item having mapping. Some functions supported by operator can be found in the following:

>>> [name for name in dir(operator) if not name.startswith('_')]
['abs', 'add', 'and_', 'attrgetter', 'concat', 'contains',
'countOf', 'delitem', 'eq', 'floordiv', 'ge', 'getitem', 'gt',
'iadd', 'iand', 'iconcat', 'ifloordiv', 'ilshift', 'imod', 'imul', 'index', 'indexOf', 'inv', 'invert', 'ior', 'ipow', 'irshift', 'is_', 'is_not', 'isub', 'itemgetter', 'itruediv', 'ixor', 'le', 'length_hint', 'lshift', 'lt', 'methodcaller', 'mod', 'mul', 'ne', 'neg', 'not_', 'or_', 'pos', 'pow', 'rshift', 'setitem', 'sub', 'truediv', 'truth', 'xor']

Most of the 52 names above are self-evident. The group of names prefixed with i and
the name of another operator — e.g. iadd, iand etc. — correspond to the augmented
assignment operators — e.g. +=, &= etc.

Also check methodcaller in operator that could create those functions based on other built-in functions.

functools.partial can create partial functions where part of the arguments are freezed compared to the normal functions. Here is an example
1
2
3
4
5
6
7
>>> from operator import mul
>>> from functools import partial
>>> triple = partial(mul, 3)
>>> triple(7)
21
>>> list(map(triple, range(1, 10)))
[3, 6, 9, 12, 15, 18, 21, 24, 27]
partial takes a callable as first argument, followed by an arbitrary number of positional
and keyword arguments to bind. On a sidenote, partialmethod (introduced from Python 3.4) is same to partial but for class method, refer to this.

Decorators and closures

The first crucial fact about decorators is that they have the power to replace the decorated function with a different one. The second crucial fact is that they are executed immediately when a module is loaded.
Check the following decorators and prints to understand it better.

# registration.py
registry = []
def register(func):
    print('running register(%s)' % func)
    registry.append(func)
    return func
@register
def f1():
    print('running f1()')
@register
def f2():
    print('running f2()')
def f3():
    print('running f3()')
def main():
    print('running main()')
    print('registry ->', registry)
    f1()
    f2()
    f3()
if __name__=='__main__':
    main()

## running this as a python file
$ python3 registration.py
running register(<function f1 at 0x100631bf8>)
running register(<function f2 at 0x100631c80>)
running main()
registry -> [<function f1 at 0x100631bf8>, <function f2 at 0x100631c80>]
running f1()
running f2()
running f3()

Closure: similar to the meaning applied to other programming languages, the closure is reachability of an local variable. Consider the following decorator example:

def make_averager():
    series = []
    def averager(new_value):
        series.append(new_value)
        total = sum(series)
        return total/len(series)
    return averager
# load the module and run     
>>> avg = make_averager()
>>> avg(10)
10.0
>>> avg(11)
10.5
>>> avg(12)
11.0

Where series is a free variable (variable that is not bound in the local scope). Although we could not find in the local scope, the series is bound and kept in the __closure___ attribute of the returned function avg.
Here is how we could find the free variable (continued from last code block).

>>> avg.__code__.co_varnames
('new_value', 'total')
>>> avg.__code__.co_freevars
('series',)
>>> avg.__closure__
(<cell at 0x107a44f78: list object at 0x107a91a48>,)
>>> avg.__closure__[0].cell_contents
[10, 11, 12]

To make it easier, we only need to deal with those “external variables” when we define a function nested in another function.

nonlocal is used to keep immutable free variables not changed to local even if there is an assignment in the local scope. Here is a function which give the same output with the previous averager.

def make_averager():
    count = 0
    total = 0
    def averager(new_value):
        nonlocal count, total # without this line assignment error will occur because count will be update to be a local variable
        count += 1
        total += new_value
        return total / count
    return averager

A draw back for a simple decorator is that it will mask the __name__ and __doc__ of the decorated functions. We could use functools.wraps to copy those information to the decorated return function.
Check singledispatch decorator at here.

stacked decorators applied from inner to outer:

@d1
@d2
def f():
    ...

# is the same as 
f = d1(d2(f))

Use decorator factory method could let the decorator accept other arguments, note that the factory method return the decorator rather than the inner function defined inside the decorator. Samples:

registry = set()
def register(active=True):
    def decorate(func):
        print('running register(active=%s)->decorate(%s)'
            % (active, func))
        if active:
            registry.add(func)
        else:
            registry.discard(func)
        return func
    return decorate

@register(active=False)
def f1():
    print('running f1()')

@register() 
# here we need to add bracket to call register to get real decorator!!
def f2():
    print('running f2()')

Check Python Decorator Library for more practical examples of decorators.

Object Oriented Idioms

Object references, mutability, recylcing

Python variables are reference to the real object.

>>> a = [1, 2, 3]
>>> b = a 
>>> a.append(4)
>>> b
[1, 2, 3, 4]

use is to check object identities are same or not, use == to check the values are equal or not between two objects. is is faster than == because the former cannot be overloaded so Python does not have to find and invoke special methods to evaluate it. Note that == is the syntactic sugar for __eq__ function.

Copies are shallow by default: the shallows refers during copy the outermost container get copied but inside elements are not copied. Here is an example

>>> l1 = [3, [55, 44], (7, 8, 9)]
>>> l2 = list(l1)
>>> l2
[3, [55, 44], (7, 8, 9)]
>>> l2 == l1
True
>>> l2 is l1
False
>>> l2[1].append(100)
>>> l1
[3, [55, 44, 100], (7, 8, 9)]
>>> l2
[3, [55, 44, 100], (7, 8, 9)]

use deepcopy to copy objects which don’t have shared reference to the source objects. Also we could customized copy and deepcopy behavior by implementing __copy__() and __deepcopy__() method according to the doc.

The parameter passing in Python is call by sharing: each formal parameter of the function gets a copy of each reference in the arguments. The result of this scheme is that a function may change any mutable object passed as a parameter, but it cannot change the identity of those objects, i.e. it cannot replace altogether an object with another. Here is an example

>>> def f(a, b):
...     a += b
...     return a
...
>>> x = 1
>>> y = 2
>>> f(x, y)
3
# x and y does not change because they are int type which is immutable, so changes in function cannot replace x with a new int object.
>>> x, y
(1, 2)
>>> a = [1, 2]
>>> b = [3, 4]
>>> f(a, b)
[1, 2, 3, 4]
# the object a does not get changed in identity, but the value are indeed changed.
>>> a, b
([1, 2, 3, 4], [3, 4])
>>> t = (10, 20)
>>> u = (30, 40)
>>> f(t, u)
(10, 20, 30, 40)
>>> t, u
((10, 20), (30, 40))
# tuple are immutable objects, so the behavior should be the same as int object discussed above.

Avoid setting the default parameter as mutable object (list, dict, …). Check the example to understand the side effect from it.

class HauntedBus:
    """A bus model haunted by ghost passengers"""
    def __init__(self, passengers=[]):
        self.passengers = passengers
    def pick(self, name):
        self.passengers.append(name)
    def drop(self, name):
        self.passengers.remove(name)

>>> bus1 = HauntedBus(['Alice', 'Bill'])
>>> bus1.passengers
['Alice', 'Bill']
>>> bus1.pick('Charlie')
>>> bus1.drop('Alice')
>>> bus1.passengers
['Bill', 'Charlie']
>>> bus2 = HauntedBus()
>>> bus2.pick('Carrie')
>>> bus2.passengers
['Carrie']
>>> bus3 = HauntedBus()
>>> bus3.passengers
['Carrie']
# the bus3 acutally shares the passengers list with bus2!!
>>> bus3.pick('Dave')
>>> bus2.passengers
['Carrie', 'Dave']
>>> bus2.passengers is bus3.passengers
True
>>> bus1.passengers
['Bill', 'Charlie']

The problem is that each default value is evaluated when the function is defined — i.e. usually when the module is loaded — and the default values become attributes of the function object. So if a default value is a mutable object, and you change it, the change will affect every future call of the function.

The defensive way to handle the above problem:

def __init__(self, passengers=None):
    # make default parameter of passengers to be immutable
    if passengers is None: 
        self.passengers = []
    else:
        self.passengers = list(passengers)
        # here do not directly assign passengers passed in to be the field. Becuase in this way `self.passengers` is only the reference to the outside list. We need to create a copy of that list

The del statement deletes names, not objects. An object may be garbage collected as result of a del command, but only if the variable deleted holds the last reference to the
object, or if the object becomes unreachable. In CPython the primary algorithm for garbage collection is reference counting. But for other implementations of Python they have more sophisticated garbage collector that do not rely on reference counting. We could use weakref.finalize to register a callback function to be called when an object is destroyed.
Weak references refers to those references that will not increase the reference count, a low level mechanism underlying the more useful collections WeakValueDictionary, WeakKeyDictionary, WeakSet, and the finalize function from the weakref module are related to some use case of weak references.
Some strings or integers are interned, where creating two variables with the same string or integers separately will make two variables references to the same object. However it is mostly dependent on different implementation of Python. But be careful to those common stings and integers: (1)Integers in range(-5, 257), (2) Often-used, single-character, and empty strings.
(Source: Slide 22 from Wesley Chun’s talk). The intering is quite common, for example the string interning.

Pythonic object

Difference between object representation: repr() and str():

repr():
Return a string representing the object as the developer wants to see it.
str():
Return a string representing the object as the user wants to see it.

Difference between @classmethod and @staticmethod, the former take the class (not the instance) as the first argument, so it could change the behavior based on which class is passed in; while the latter don’t have any class object as the input and it is mainly used when we want to do something only related to the specific class, regardless of what instance in the class. A good reference artical discussed the difference in detail.
How format() works: check here about the format mini-language. Then we could add custom formatting for a class by implementing __format__ functions.

Name mangling: if we name a field inside a class starting with double underscore and no or one underscore the the end, Python stores the name in the instance __dict__ prefixed with a leading underscore and the class name. Note that it is about safety, not security: it’s designed to prevent accidental access
and not intentional wrongdoing.
The single underscore prefix has no special meaning to the Python interpreter when used in attribute names, but it’s a very strong convention among Python programmers that you should not access such attributes from outside the class. In modules, a single _ in front of a top-level name does have an effect: if you write from mymod import * the names with a _ prefix are not imported from mymod. But you can still write from mymod import _privatefunc. According to this.
To define __slots__ you create a class attribute with that name and assign it an iterable of str with identifiers for the instance attributes. The author prefers using a tuple for that, because it conveys the message that the __slots__ definition cannot change. That will help to reduce the overhead of per-instance __dict__. However, here are few drawbacks:

You must remember to redeclare __slots__ in each subclass, since the inherited attribute is ignored by the interpreter.
Instances will only be able to have the attributes listed in __slots__, unless you include ‘__dict__‘ in __slots__ — but doing so may negate the memory savings.
Instances cannot be targets of weak references unless you remember to include ‘__weakref__‘ in __slots__.

Sequence hacking, hashing and slicing

repr() is conventionaly used by debugging print, so it will abbreviate if the array output has larger (>6) elements. We should use reprlib to limit the printing length for customized class.

How slice works? The slice index become slice object:

>>> class MySeq:
...     def __getitem__(self, index):
...         return index #
...
>>> s = MySeq()
>>> s[1] #
1
>>> s[1:4] #
slice(1, 4, None)
>>> s[1:4:2] #
slice(1, 4, 2)
>>> s[1:4:2, 9] #
(slice(1, 4, 2), 9)
>>> s[1:4:2, 7:9] #
(slice(1, 4, 2), slice(7, 9, None))

To handle better slicing, in the customized class __getitem__ function, we should separate the logic for type(index) == slice and numbers.Integral.

__getattr__ is used to get a specifc attribute in a class. Dynamic attributes could be created using specific logic in this function. Remember to raise AttributeError accordingly. __setattr__ is the corresponding setter, could used to set read only attributes, which is pretty useful in hashable object.
When using reduce it’s good practice to provide the third argument,
reduce(function, iterable, initializer), to prevent this
exception: TypeError: reduce() of empty sequence with no initial value.

Interfaces: from protocols to ABCs

Warning of over engineering: ABCs, like descriptors and metaclasses, are tools for building frameworks.
Therefore, only a very small minority of Python developers
can create ABCs without imposing unreasonable limitations and
needless work on fellow programmers.
Every class has an interface: the set public attributes (methods or data attributes) implemented or inheritedby the class. For example,
__getitem__ or __add__.
Monkey patching is to change a class or module at runtime without touchign the source code.

Other information

Python in the book: CPython 3.4
Github: fluentpython
Useful visualization of Python code running: pythontutor
use dir function to check the keyword of an object.
use dis function to get the bytecode of a python function, which could be helpful if some subtle difference comparision.

[Missing Semester in CS] Shell scripting and tools

Posted on Apr 5th 2020 Edited on May 19th 2020 In Tools

After going through the basic bash commands in the previous class, we will explore more in putting all sort of commands together into a bash script file.

Brief

There are a bunch of must-knows for writing the scripts introduced in the class. Here is the complete list to check if you all understand:

variables assignment (NO SPACES!)
difference between strings enclosed with ' and "
control flows: if, while, case, for
function definition, inputs, return value and some special variables.
&& and || operators
get the outputs from a command
testing condition, also mentioned in my last post.
wildcards(foo*) and curly brace(foo.{jpg,png}, foo{a..h}, foo{1..100}) for variables.
shellcheck

[Missing Semester in CS] Shell

Posted on Apr 1st 2020 Edited on May 19th 2020 In Tools

The first class introduces the terminal shell, and they mainly focus on Bash since it is the default in most platform.

Brief

It is perfect for the beginners without any knowledge of shell. Starting from terminal basics to some basic commands in bash such as echo and ls, the class provides enough information for anyone to start to explore the shell system (I guess man is enough to start …).

After going through the video and texts, we still need to figure out how to use more commands in order to grasp the basics of temrinal. I guess the most important experience for me is to search only when you need it. There are tons of commands to know if we want them all, and each commands has tons of options to explore. It turns to be extremely boring for me if I learn by reading A-Z exhausitive reference of shell commands. If you want to know how to find a file, then you learn find, and if you need to change a batch of files’ names, you get to know for loop and probably sed.

Another thing that might be helpful is, there are always many ways to finish one target. If you don’t like the top answer from StackOverflow, maybe the second one or some comments might attract you.

Missing Semester for CS

Posted on Mar 30th 2020 Edited on May 19th 2020 In Tools

The Missing Sememster of CS degree is actually the missing one for my coursework in CS degree. It is a good chance for anyone who has a CS degree to brush up his or her idea about tools behind the Computer Science. Absolutely it would be the perfect course for students who are now learning CS.

Here is the course structure, they cover a bunch of useful tools that are indeed beneficial to our development journey.

Course Structure

Shell
Vim
Data manipulation
Commandline environment
Git
Debuging, logging, profiling
Makefiles
Security
Miscellaneous