• Home   /  
  • Archive by category "1"

Clojure Conditional Assignment Of Lease

This post introduces a new data structure - the vector map - which solves some issues related to storing collections in MVCC data stores. Further, vector maps have some super nice use cases for "occasionally connected" systems.

The idea warrants a more rigorous discourse, but I need to get it off my chest, so here is a blog entry describing it.

Modern distributed data stores such as CouchDB and Riak, use variants of Multi-Version Concurrency Control to detect conflicting database updates and present these as multi-valued responses.

So, if I and my buddy Ola both update the same data record concurrently, the result may be that the data record now has multiple values - both mine and Ola's - and it will be up to the eventual consumer of the data record to resolve the problem. The exact schemes used to manage the MVCC differs from system to system, but the effect is the same; the client is left with the turd to sort out.

This led me to an idea, of trying to create a data structure which is by it's very definition itself able to be merged, and then store such data in these kinds of databases. So, if you are handed two versions, there is a reconciliation function that will take those two records and "merge" them into one sound record, by some definition of "sound".

From what I have seen, the "thing" stored is often itself a collection like a list or a hash map, and say that Ola and I both add new elements to the collection and store the results, the resulting multiple records are - with proper definitions - naturally mergeable; namely the list or map that contains the original entries plus both mine and Ola's.

So, this is the presentation of my idea: A vector map, which is a data structure that is designed to be used in this context. It also has other interesting applications as we shall discuss towards the end of this post.

Vector Maps

A vector map is defined as a set of assignment events, Key=Value, each such event being time stamped with a vector clock (hence the name). From a high-level point of view, a vector map can be seen as a hash table i.e., a collection key/value pairs.

Two vector maps can be reconciled (VectorMap1VectorMap2), so that for each key, the "most recent assignment event" wins. If assignment events are in conflict (vector-clock wise concurrent), then the resulting value is multivalued.

The reconciliation function for vector maps is defined so that it is commutative i.e., it can be applied in any order i.e.,

A B = B A,

It is also associative,

A (B C) = (A B) C,

which means that reordering done inside the database store, is insignificant to the resulting usage.

In fact, Riak itself (the distributed key/value store) is just a big distributed and redundant version of this; which just goes to prove the versatility of the idea. So in a sense, a vector map is just a recursion over the Riak (Dynamo) concepts applied to a data structure.

In the following we will define things a little more rigorously and provide some examples; and towards the end there is a discussion of how vector maps can be used.

Assignments as Vector Clocked Events

For the purposes of this discourse, we will model an assignment event as

  • a Key,
  • a Set of values, and
  • a vector clock, VC, describing when the assignment happened,

so an assignment event has the following form:

Assignment :: VC: Key = [Value, Value, ...]

We will let the set contain multiple values in case a conflict has been observed, but to start off with our sets will be single-valued.

Our system will use the vector clock to determine "who wins". Let's see what happens ...

To begin with, I do an assignment "X" = 4, and record this event:

Assignment1 = (krab:1) : "X" = [4]

Reading: at krab's time 1, "X" is bound to 4.

Later, after observing my assignment, a colleague Ola, at his time 2, defines "X" to be 5

Assignment2 = (ola:2,krab:1) : "X" = [5]

Now, the immediate beauty is that because each assignment is time stamped with a vector clock, we can easily determine that Assignment2 happened after Assignment1 (the vector clock says that Ole did indeed observe Assignment1 before creating his own), and so if we see both we can discard the earlier one without loss.

Conflict happens

Now, what happens if Jens comes in, and based on only observing my original assignment, he reassigns "X" to be 7. We'll describe this with the following event:

Assignment3 = (jens:3,krab:1) : "X" = [7]

To Jens, this is not problematic, but if someone observes both Assignment2 and Assignment3 they'll know that Jens did not observe Ola's time 2.

To make some kind of sense out of this, we define the reconciliation operator ⊕, describing the aggregation of knowledge we have when combining the known events.

Assignment2 Assignment3 = (krab:1,ola:2,jens:3) : "X" = [5,7]

I.e., we describe that "X" has conflicting values 5 and 7 at a point in time which is after both Assignment2 and Assignment3.

Deleting values

A further complication is what happens when we want to delete a binding, but this is handled quite simply by making a new assignment to a unique value which is somehow outside scope of other possible values; and then reconciling that into the current event set. This does have some interesting properties, because there is an observable difference between a binding that was never there, and a binding that was removed.

Defining Reconciliation

In the example above, we can see from the vector clocks that Jens had not observed Ola's assignment, and so we reconciled the two conflicting events by

  1. creating an artificial vector clock (krab:1,ola:2,jens:3) which is logically after both (ola:2,krab:1) and (jens:3,krab:1), and
  2. combining the bound values [5] and [7] into a set of values [5,7].

Had it been obvious that one assignment happened before the other, we would simply had thrown one of them away, but because the two vector clocks were in conflict we have to recognize the conflict.

So the actual definition of reconciliation for two assignment events with the same key is as follows:

VC1:Key=Set1 VC2:Key=Set2
        VC1 ≤ VC2 → VC2 : Key = Set2;
        VC2 ≤ VC1 → VC1 : Key = Set1;
        → lub(VC1,VC2) : Key = (Set1 ∪ Set2)

The reconciliation operator (o-plus) is commutative (just like good old addition), so we can use it to reconcile assignment events in any order and we'll always arrive at the same result in the end. Which makes it perfect for making decisions in a distributed system; because it means that even if we get to know about events in different order, the state will eventually reconcile to the same value (eventually meaning when we've seen all the events).

The definition uses the (least upper bound) on two vector clocks, which combines two such by taking the maximum local time stamp for each agent present in two sets of vector clocks:

lub( VC1, VC2)
        ∀ a ∈ agents(VC1) ∪ agents(VC2)
            a : max( time(VC1,a), time(VC2,a) )

Where time(VC,a) is 0 (zero) if VC does not list an agent a. For example

lub( (b:3), (a:1, b:2) ) = (a:1, b:3)

What is this good for?

You may rightly say that this doesn't solve the problem, it just pushes the problem one level down. And that's right; but ... it does solve an interesting range of problems, and with some care you can often structure your usage of keys in vector maps so that you can avoid conflicts all together.

Further, since your favorite data store already does this for you ... you may say that you can just split your map into individual key/value bindings and store those in the database. But that comes at a price of network round trips and data locality.

Use case: Modeling Relationships in MVCC data stores

Vector maps are really nice for modeling relationships inside MVCC databases.

Assuming you want to store a one-to-many relationship in Riak (say, order - order-item), you run into the problem on the one-side that it is likely to be concurrently updated if multiple items are added to the same order.

With vector maps, you can easily model the entire relationship as one order object which is just a vector map, where each item is stored with a distinct key. If that key is e.g. a sufficiently large random number, making it very unlikely that order-item-id's conflict, then you're pretty much home free. Alternatively, you can devise a mechanism so that each client of the system is able to construct globally unique keys (agent + sequence number).

For this use case, you'll see much improved performance also, because of the improved locality of reference. If your data store needs to go and fetch each individual order-item on the disk somewhere, then performance will be seriously hampered.

Use case: off-line data

Vector maps are great for off-line data, because they give a well defined meaning to the concept of synchronization (something I would really like my iCal to do :-) Synchronization is simply defined as the exchange of vector maps, storing the result of the reconciliation on both sides.

Such synchronization can happen in near-real time (peer-to-peer update) or as a delayed synchronization whenever there is contact to a server/peer.

This is perhaps the most interesting use case, because it an be used as a simple foundation for making data available to e.g. mobile clients in an "occasionally connected" system, in a way that makes sense for both online and offline mode.

Implementation Issues

Since it is a little complicated to manipulate a vector map, we need implementations in the most common languages out there to get it off the ground. I'm currently hacking on an Erlang and a Java version.

Working on this, and I have come to the conclusion that it would be great if vector maps have a well defined binary representation so that they can be meaningfully manipulated in a number of different contexts, easily stored and transmitted, etc. So, a special mime-type that lets multiple parties consume vector maps independent of programming languages.

If vector maps had it's own mime type and well defined data representation, data stores such as Riak or CouchDB could even do the reconciliation automatically before serving the data to a client.

So, right now I am working with a protocol buffers definition that looks like this on the wire, to be encoded as Content-Type , and likely a also a JSON representation .

message VectorMap { repeated Entry entries = 1; } message Entry { required string key = 1; repeated Clock vclocks = 2; repeated Value value = 3; } message Value { optional string mime_type = 1 [ default = "application/json;charset=utf-8" ]; optional bytes content = 2; optional bool deleted = 3 [ default = false ]; } message Clock { required string node = 1; required uint32 counter = 2; required uint64 utc_millis = 3; }

Conclusions

The Dynamo idea of using vector clocks to time stamp data is great, but I think the power of the idea goes quite a bit further if the logic is brought all the way to the client.

CouchDB tries to do this by suggesting that client devices (mobile devices) should have a full fledged CouchDB running there. But I think that exposing data this way makes the mobile client model much more manageable. This entire idea can be wrapped up in a single Java class which is easily deployed in a Android app; and it is sufficiently simple to be implementable in a range of programming languages.

What do you think?

1. Software developer – A software developer is a person concerned with facets of the software development process, including the research, design, programming, and testing of computer software. Other job titles which are used with similar meanings are programmer, software analyst. According to developer Eric Sink, the differences between system design, software development, and programming are more apparent, even more so that developers become systems architects, those who design the multi-leveled architecture or component interactions of a large software system. In a large company, there may be employees whose sole responsibility consists of one of the phases above. In smaller development environments, a few people or even an individual might handle the complete process. The word software was coined as a prank as early as 1953, before this time, computers were programmed either by customers, or the few commercial computer vendors of the time, such as UNIVAC and IBM. The first company founded to provide products and services was Computer Usage Company in 1955. The software industry expanded in the early 1960s, almost immediately after computers were first sold in mass-produced quantities, universities, government, and business customers created a demand for software. Many of these programs were written in-house by full-time staff programmers, some were distributed freely between users of a particular machine for no charge. Others were done on a basis, and other firms such as Computer Sciences Corporation started to grow. The computer/hardware makers started bundling operating systems, systems software and programming environments with their machines, new software was built for microcomputers, so other manufacturers including IBM, followed DECs example quickly, resulting in the IBM AS/400 amongst others. The industry expanded greatly with the rise of the computer in the mid-1970s. In the following years, it created a growing market for games, applications. DOS, Microsofts first operating system product, was the dominant operating system at the time, by 2014 the role of cloud developer had been defined, in this context, one definition of a developer in general was published, Developers make software for the world to use. The job of a developer is to crank out code -- fresh code for new products, code fixes for maintenance, code for business logic, bus factor Software Developer description from the US Department of Labor

2. Software release life cycle – Usage of the alpha/beta test terminology originated at IBM. As long ago as the 1950s, IBM used similar terminology for their hardware development, a test was the verification of a new product before public announcement. B test was the verification before releasing the product to be manufactured, C test was the final test before general availability of the product. Martin Belsky, a manager on some of IBMs earlier software projects claimed to have invented the terminology, IBM dropped the alpha/beta terminology during the 1960s, but by then it had received fairly wide notice. The usage of beta test to refer to testing done by customers was not done in IBM, rather, IBM used the term field test. Pre-alpha refers to all activities performed during the project before formal testing. These activities can include requirements analysis, software design, software development, in typical open source development, there are several types of pre-alpha versions. Milestone versions include specific sets of functions and are released as soon as the functionality is complete, the alpha phase of the release life cycle is the first phase to begin software testing. In this phase, developers generally test the software using white-box techniques, additional validation is then performed using black-box or gray-box techniques, by another testing team. Moving to black-box testing inside the organization is known as alpha release, alpha software can be unstable and could cause crashes or data loss. Alpha software may not contain all of the features that are planned for the final version, in general, external availability of alpha software is uncommon in proprietary software, while open source software often has publicly available alpha versions. The alpha phase usually ends with a freeze, indicating that no more features will be added to the software. At this time, the software is said to be feature complete, Beta, named after the second letter of the Greek alphabet, is the software development phase following alpha. Software in the stage is also known as betaware. Beta phase generally begins when the software is complete but likely to contain a number of known or unknown bugs. Software in the phase will generally have many more bugs in it than completed software, as well as speed/performance issues. The focus of beta testing is reducing impacts to users, often incorporating usability testing, the process of delivering a beta version to the users is called beta release and this is typically the first time that the software is available outside of the organization that developed it. Beta version software is useful for demonstrations and previews within an organization

3. Repository (version control) – In revision control systems, a repository is an on-disk data structure which stores metadata for a set of files and/or directory structure. Some of the metadata that a repository contains includes, among other things, a set of references to commit objects, called heads. The main purpose of a repository is to store a set of files and these differences in methodology have generally led to diverse uses of revision control by different groups, depending on their needs. Software repository Codebase Forge Comparison of source code hosting facilities

4. Scala (programming language) – Scala is a general-purpose programming language providing support for functional programming and a strong static type system. Designed to be concise, many of Scalas design decisions were designed to build from criticisms of Java, Scala source code is intended to be compiled to Java bytecode, so that the resulting executable code runs on a Java virtual machine. Scala provides language interoperability with Java, so that libraries written in languages may be referenced directly in Scala or Java code. Like Java, Scala is object-oriented, and uses a syntax reminiscent of the C programming language. Unlike Java, Scala has many features of programming languages like Scheme, Standard ML and Haskell, including currying, type inference, immutability, lazy evaluation. It also has a type system supporting algebraic data types, covariance and contravariance, higher-order types. Other features of Scala not present in Java include operator overloading, optional parameters, named parameters, raw strings, the name Scala is a portmanteau of scalable and language, signifying that it is designed to grow with the demands of its users. The design of Scala started in 2001 at the École Polytechnique Fédérale de Lausanne by Martin Odersky and it followed on from work on Funnel, a programming language combining ideas from functional programming and Petri nets. Odersky formerly worked on Generic Java, and javac, Suns Java compiler, after an internal release in late 2003, Scala was released publicly in early 2004 on the Java platform, A second version followed in March 2006. Although Scala had extensive support for programming from its inception. On 17 January 2011 the Scala team won a research grant of over €2.3 million from the European Research Council. On 12 May 2011, Odersky and collaborators launched Typesafe Inc. a company to provide support, training. Typesafe received a $3 million investment in 2011 from Greylock Partners, Scala runs on the Java platform and is compatible with existing Java programs. The reference Scala software distribution, including compiler and libraries, is released under a BSD license, Scala. js is a Scala compiler that compiles to JavaScript, making it possible to write Scala programs that can run in web browsers. Scala Native is a Scala compiler that targets the LLVM compiler infrastructure to create code that uses a lightweight managed runtime which utilizes the Boehm garbage collector. The project is led by Denys Shabalin and had its first release,0.1, a reference Scala compiler targeting the. NET Framework and its Common Language Runtime was released in June 2004, but was officially dropped in 2012. Indeed, Scalas compiling and executing model is identical to that of Java, a shorter version of the Hello World Scala program is, Scala includes interactive shell and scripting support. Saved in a file named HelloWorld2. scala, this can be run as a script with no prior compiling using, value types are capitalized, Int, Double, Boolean instead of int, double, boolean

5. CUDA – CUDA is a parallel computing platform and application programming interface model created by Nvidia. It allows software developers and software engineers to use a CUDA-enabled graphics processing unit for general purpose processing – an approach termed GPGPU. The CUDA platform is a layer that gives direct access to the GPUs virtual instruction set and parallel computational elements. The CUDA platform is designed to work with programming such as C, C++. Also, CUDA supports programming frameworks such as OpenACC and OpenCL, when it was first introduced by Nvidia, the name CUDA was an acronym for Compute Unified Device Architecture, but Nvidia subsequently dropped the use of the acronym. The graphics processing unit, as a computer processor, addresses the demands of real-time high-resolution 3D graphics compute-intensive tasks. By 2012, GPUs had evolved into highly parallel multi-core systems allowing very efficient manipulation of large blocks of data, C/C++ programmers use CUDA C/C++, compiled with nvcc, Nvidias LLVM-based C/C++ compiler. Fortran programmers can use CUDA Fortran, compiled with the PGI CUDA Fortran compiler from The Portland Group, third party wrappers are also available for Python, Perl, Fortran, Java, Ruby, Lua, Haskell, R, MATLAB, IDL, and native support in Mathematica. In the computer industry, GPUs are used for graphics rendering. CUDA has also used to accelerate non-graphical applications in computational biology, cryptography. CUDA provides both a low level API and a higher level API, the initial CUDA SDK was made public on 15 February 2007, for Microsoft Windows and Linux. Mac OS X support was added in version 2.0. CUDA works with all Nvidia GPUs from the G8x series onwards, including GeForce, Quadro, CUDA is compatible with most standard operating systems. Nvidia states that developed for the G8x series will also work without modification on all future Nvidia video cards. This can be used as a cache, enabling higher bandwidth than is possible using texture lookups. This was not always the case, earlier versions of CUDA were based on C syntax rules. Interoperability with rendering languages such as OpenGL is one-way, with OpenGL having access to registered CUDA memory, unlike OpenCL, CUDA-enabled GPUs are only available from Nvidia. No emulator or fallback functionality is available for modern revisions, valid C++ may sometimes be flagged and prevent compilation due to the way the compiler approaches optimization for target GPU device limitations

6. C (programming language) – C was originally developed by Dennis Ritchie between 1969 and 1973 at Bell Labs, and used to re-implement the Unix operating system. C has been standardized by the American National Standards Institute since 1989, C is an imperative procedural language. Therefore, C was useful for applications that had formerly been coded in assembly language. Despite its low-level capabilities, the language was designed to encourage cross-platform programming, a standards-compliant and portably written C program can be compiled for a very wide variety of computer platforms and operating systems with few changes to its source code. The language has become available on a wide range of platforms. In C, all code is contained within subroutines, which are called functions. Function parameters are passed by value. Pass-by-reference is simulated in C by explicitly passing pointer values, C program source text is free-format, using the semicolon as a statement terminator and curly braces for grouping blocks of statements. The C language also exhibits the characteristics, There is a small, fixed number of keywords, including a full set of flow of control primitives, for, if/else, while, switch. User-defined names are not distinguished from keywords by any kind of sigil, There are a large number of arithmetical and logical operators, such as +, +=, ++, &, ~, etc. More than one assignment may be performed in a single statement, function return values can be ignored when not needed. Typing is static, but weakly enforced, all data has a type, C has no define keyword, instead, a statement beginning with the name of a type is taken as a declaration. There is no function keyword, instead, a function is indicated by the parentheses of an argument list, user-defined and compound types are possible. Heterogeneous aggregate data types allow related data elements to be accessed and assigned as a unit, array indexing is a secondary notation, defined in terms of pointer arithmetic. Unlike structs, arrays are not first-class objects, they cannot be assigned or compared using single built-in operators, There is no array keyword, in use or definition, instead, square brackets indicate arrays syntactically, for example month. Enumerated types are possible with the enum keyword and they are not tagged, and are freely interconvertible with integers. Strings are not a data type, but are conventionally implemented as null-terminated arrays of characters. Low-level access to memory is possible by converting machine addresses to typed pointers

7. C++ – C++ is a general-purpose programming language. It has imperative, object-oriented and generic programming features, while also providing facilities for low-level memory manipulation and it was designed with a bias toward system programming and embedded, resource-constrained and large systems, with performance, efficiency and flexibility of use as its design highlights. C++ is a language, with implementations of it available on many platforms and provided by various organizations, including the Free Software Foundation, LLVM, Microsoft, Intel. C++ is standardized by the International Organization for Standardization, with the latest standard version ratified and published by ISO in December 2014 as ISO/IEC14882,2014. The C++ programming language was standardized in 1998 as ISO/IEC14882,1998. The current C++14 standard supersedes these and C++11, with new features, the C++17 standard is due in 2017, with the draft largely implemented by some compilers already, and C++20 is the next planned standard thereafter. Many other programming languages have influenced by C++, including C#, D, Java. In 1979, Bjarne Stroustrup, a Danish computer scientist, began work on C with Classes, the motivation for creating a new language originated from Stroustrups experience in programming for his Ph. D. thesis. When Stroustrup started working in AT&T Bell Labs, he had the problem of analyzing the UNIX kernel with respect to distributed computing, remembering his Ph. D. experience, Stroustrup set out to enhance the C language with Simula-like features. C was chosen because it was general-purpose, fast, portable, as well as C and Simulas influences, other languages also influenced C++, including ALGOL68, Ada, CLU and ML. Initially, Stroustrups C with Classes added features to the C compiler, Cpre, including classes, derived classes, strong typing, inlining, furthermore, it included the development of a standalone compiler for C++, Cfront. In 1985, the first edition of The C++ Programming Language was released, the first commercial implementation of C++ was released in October of the same year. In 1989, C++2.0 was released, followed by the second edition of The C++ Programming Language in 1991. New features in 2.0 included multiple inheritance, abstract classes, static functions, const member functions. In 1990, The Annotated C++ Reference Manual was published and this work became the basis for the future standard. Later feature additions included templates, exceptions, namespaces, new casts, after a minor C++14 update released in December 2014, various new additions are planned for 2017 and 2020. According to Stroustrup, the name signifies the nature of the changes from C. This name is credited to Rick Mascitti and was first used in December 1983, when Mascitti was questioned informally in 1992 about the naming, he indicated that it was given in a tongue-in-cheek spirit

8. Python (programming language) – Python is a widely used high-level programming language for general-purpose programming, created by Guido van Rossum and first released in 1991. The language provides constructs intended to enable writing clear programs on both a small and large scale and it has a large and comprehensive standard library. Python interpreters are available for operating systems, allowing Python code to run on a wide variety of systems. CPython, the implementation of Python, is open source software and has a community-based development model. CPython is managed by the non-profit Python Software Foundation, about the origin of Python, Van Rossum wrote in 1996, Over six years ago, in December 1989, I was looking for a hobby programming project that would keep me occupied during the week around Christmas. Would be closed, but I had a computer. I decided to write an interpreter for the new scripting language I had been thinking about lately, I chose Python as a working title for the project, being in a slightly irreverent mood. Python 2.0 was released on 16 October 2000 and had major new features, including a cycle-detecting garbage collector. With this release the development process was changed and became more transparent, Python 3.0, a major, backwards-incompatible release, was released on 3 December 2008 after a long period of testing. Many of its features have been backported to the backwards-compatible Python 2.6. x and 2.7. x version series. The End Of Life date for Python 2.7 was initially set at 2015, many other paradigms are supported via extensions, including design by contract and logic programming. Python uses dynamic typing and a mix of reference counting and a garbage collector for memory management. An important feature of Python is dynamic name resolution, which binds method, the design of Python offers some support for functional programming in the Lisp tradition. The language has map, reduce and filter functions, list comprehensions, dictionaries, and sets, the standard library has two modules that implement functional tools borrowed from Haskell and Standard ML. Python can also be embedded in existing applications that need a programmable interface, while offering choice in coding methodology, the Python philosophy rejects exuberant syntax, such as in Perl, in favor of a sparser, less-cluttered grammar. As Alex Martelli put it, To describe something as clever is not considered a compliment in the Python culture. Pythons philosophy rejects the Perl there is more one way to do it approach to language design in favor of there should be one—and preferably only one—obvious way to do it. Pythons developers strive to avoid premature optimization, and moreover, reject patches to non-critical parts of CPython that would offer an increase in speed at the cost of clarity

9. Clojure

One thought on “Clojure Conditional Assignment Of Lease

Leave a comment

L'indirizzo email non verrà pubblicato. I campi obbligatori sono contrassegnati *