Thursday, June 28, 2007

Web3S pushes Secondary Resources

Summary: Web3S doesn't naturally support RESTful primary resource identifiers for EIIs. It can, but only with optional elements and server-specific guarantees about identifiers. A generic Web3S client could never "know" what canonical and primary resources were identified.

Yaron responded to my Web3S posts on identity in hierarchies and graph serialization support. (Yaron's comment is on the second link).

My original concern was Web3S duplicates the values of a single resource when serializing a tree, and this obscures the identity of resources.

Now, I realize the issue isn't just about obscuring the identity, but Web3S has no standard way to expose the primary resources in a system.

The properties of a Web3S:ID are
  • unique only within the containing element,
  • used to generate URIs, but also path-relative to the containing elements
Then, from the Web3S spec, section 7: The author elements have a single URI, but that is both path (within article) dependent and based on a locally unique object identifier.

This URI (a non-prefixed, short version from the example in section 7):
http://example.net/stuff/morestuff/articles/article(8383)/authors/author(23455)
doesn't provide a primary resource identifier for the author. Why? This path dependent URI is hardly different than the same URI with a fragment identifier (for a secondary resource):
http://example.net/stuff/morestuff/articles/article(8383)#author23455
From Architecture of the World Wide Web, section 2.6 Fragment Identifiers:
The fragment identifier component of a URI allows indirect identification of a secondary resource by reference to a primary resource and additional identifying information. The secondary resource may be some portion or subset of the primary resource, some view on representations of the primary resource, or some other resource defined or described by those representations. The terms "primary resource" and "secondary resource" are defined in section 3.5 of [URI].
The author in all of these examples should be (able to be) identified as a primary resource, not just a secondary one. Whether or not authors are primary or secondary resources is really a question of server implementation. The spec for Web3S makes that that decision by default for all services though.

Here is the rest of Yaron's comment, with my additional thinking added in.

Yaron writes: Web3S’s infoset is just the core set of primitive data containers, they don’t define the data models that sit on top. So, for example, one could easily define a graph based data model on top of the tree based infoset that used links to say things like “These two things are the same”. In fact section 10.1 and 10.2 of the Web3S spec define a standard HREF style element for exactly this reason.
True enough, it could be done. However, the Web3S infoset does explicitly define a notion of identify, Web3S:ID, that doesn't provide a way to express those thing.

The definition you mentioned of HREF elements makes no statement about what resource is identified, so no client can assume anything about it.
Yaron writes: By Value – If the canonical author entry and the references to author all exist within the same ‘system’ (I’m being vague intentionally but think of examples like a single DB) then likely would one just use by value. The author values would show up where needed and changing one author value would change the other. Astoria does this today but they add the additional guarantee that if two instances of a particular element (e.g. author) are in fact the same underlying object then the ID will be the same. That is completely legal in Web3S. Web3S just says that the caller can only assume that IDs are locally unique. But the server is free to offer a higher guarantee if it wants and then advertise that fact. Heck a server could choose to give every element instance a GUID/UUID and so guarantee global uniqueness.
All based on optional elements and out of band (published schema) communication.
  1. I can't write a general purpose Web3S tool that takes advantage of that, and
  2. Tools that do take advantage of that are more tightly coupled to that one service.
Yaron writes: Also, for whatever it’s worth, Astoria supports both hard and soft linking. Our current thinking about this in Web3S is that we would allow servers to advertise schemas that define object relationships, explain ID guarantees, specify hard versus soft linking, etc. We will also probably provide mechanisms to allow servers to annotate data with this information directly rather than requiring schema look up but given the bandwidth expense I’m not sure how often we would use this.
I hope you would use it all the time, otherwise you will be promoting a code generation solution with published Schemas. See my own blog posts, as well as the recent storm of discourse on WADL and REST.

I'm not exactly sure what hard and soft linking refer to here, can you expand? I don't have experience with Astoria yet, and I'm thinking inode filesystem hard links...
Yaron writes: By Reference – Alternatively there would be a single canonical author entry and anyone who wanted to refer to that entry would just use a URL ala section 10.1.
This should be the default for how Web3S works: the identifier can be a primary or secondary resource identifier in all cases. It should be absolutely server-dependent whether authors can be independently identifiable, but providing canonical and absolute identity for any EII should not require client-server coupling.

I'll think more about how to achieve this, but a first random idea is to allow the HREF element inside (and in place of) the Web3S:ID element.
Yaron writes: In either case you can get there from here.
Yes, you can, but the default, standard, and primary means of identification should directly support a canonical identifier system without reporting to optional elements and shared schemas.

Tuesday, June 26, 2007

Distributed Systems and Consensus

Mark Mc Keown has posted a fantastic summary of "Consensus, 2PC, and Transaction Commit" over the last decades.

I've read some of those materials before, but had certainly never ordered everything so clearly.

I think this is a particularly important reference:
Fischer, Lynch and Paterson showed that distributed consensus was impossible in an asynchronous system with just one faulty process in "Impossibility of distributed consensus with one faulty process" (1985), this famous result is known as the "FLP" result.
In particular, this result means that any system that is distributed will have to deal with failures somehow.

Erlang has a built in model for handling distributed failures.... more reading to do.

Going faster by duplicating work

Dare has posted some excellent summaries of Google's Scalability conference. (I can't wait till the videos are online!)

I am always entertained by counterintuitive results, and the solution to stragglers in MapReduce is no exception.

Google had an issue with "stragglers" performing tasks very slowly in the MapReduce infrastructure. The solution: duplicate the same tasks on multiple machines and throw away the redundant results. Go faster by doing more work! ;)

Related idea: set-based design. Optimize the throughput of the entire system, instead of each part.

Thoughts, techniques, and references to achieving scalability

I'm going to use this label, scalability, to collect my thoughts and random ideas for how to make systems scale.

I'm going to start with a list of links that I recognize have been formative to my understanding of scalability.

SEDA: An Architecture for Highly Concurrent Server Applications and then C10K
  • Reading the SEDA thesis and then the C10K site helped me understand that simple imperative programming might not always scale...

REST: Architectural Styles and the Design of Network-based Software Architectures
Life Beyond Distributed Transactions

  • Most recently this paper by Pat Helland influenced my thinking: Transactions can only bound a single entity, but that entity can contain various pieces of data including historical communication entries to help support idempotent messaging.

Monday, June 18, 2007

Graph based serialization examples

My previous post on Web3S hierarchical data vs. graph data didn't provide any solutions.

The explanation in the FAQ about graphs isn't sitting well with me. There is a simple way to represent graph data in serialized formats: name the data the first time it's seen and refer to it subsequent times. Examples include:
This is not a complicated pattern to serialize or write code against.

From Dare's posted example, we could add another article with an existing author like this (notice the Web3S:IDREF element):
<articles>
<article>
<Web3S:ID>8383</Web3S:ID>
<title>Manual of Surgery Volume First: General Surgery. Sixth Edition.</title>
<authors>
<author>
<Web3S:ID>23455</Web3S:ID>
<firstname>Alexander</firstname>
<lastname>Miles</lastname>
</author>
<author>
<Web3S:ID>88828</Web3S:ID>
<firstname>Alexis</firstname>
<lastname>Thomson</lastname>
</author>
</authors>
</article>
<article>
<Web3S:ID>8384</Web3S:ID>
<title>...</title>
<authors>
<author>
<Web3S:IDREF>88828</Web3S:ID>
</author>
</authors>
</article>
</articles>
But, that won't work in Web3S without many more changes. It's not the data format that's a problem, but
The problem here is identity, not data formats.

Web3S supports hierarchies? Useful ones?

Update: I've added more detailed links and summaries how this could fail in my next blog post.

Web3S seems to be clearly be driven by a need for hierarchies; from the discussions (Dave, Tim, Sam), examples about books and authors, and the FAQ on why not APP:
No Hierarchy – ATOM only supports a two level hierarchy, feed and entry. It is possible, of course, to create an entry that is really a pointer to another feed but that is both painful to handle at a protocol level and inefficient when one actually wants to retrieve an entire tree as one has to make many round trips to pull in all the values as one walks the feeds.
But, later in the FAQ on hierarchies and graphs:
... we will just have to make do with hierarchical rather than graph based data formats.
So, Web3S supports hierarchies, but not graphs. Isn't the Web a graph? What does Web3S do when two articles have the same author?

My assumption was that the author element would be repeated with the same id, so I looked further and found the FAQ on ID uniqueness where we learn that the IDs are only unique within the containing element.

Am I understanding this right? Does Web3S really not have support for maintaining a single author of many articles? Do really need to track down every occurrence of one author (by data matching instead of URI) in order to correct a spelling error?

I've not done more homework on this than the FAQ, I'll keep looking. Someone please tell me I'm wrong because this would be silly.

Wednesday, June 13, 2007

Web Sites, Web Applications, and Content Types

I couldn't find this reference when I was looking for it the other day.

Good Web APIs are just Web Sites

Excellent!

Both that presentation and an RDF Shopping example use content negotiation to serve different content types to clients. My previous signposts entry used a pretend HTML microformat. Different strategies for transferring understandable content from server to client.

The question I'm still asking myself is which of these two choices is better and why?:
  1. Separate and negotiated media types
  2. Single extensible media type
I have to admit, I find conneg to be non-Visible. I just don't feel very comfortable with it.

Candidates for extensible media types are HTML microformats and RDF (and Topic Maps and DITA and SGML Architectures).

The HTML Web has flourished (in part) because of Postel's Law:
"Be liberal in what you accept, and conservative in what you send."
Is that enough to encourage using fewer, but extensible, media types over individual crafted media types?

Monday, June 11, 2007

Java ClassLoader trick

I work on the Glassbox Java Troubleshooting Agent, and it uses Ant scripts to automate installation into various Java containers. The version of Ant we use was conflicting with existing libraries (in CruiseControl and JBoss), so I needed to create a ClassLoader sandbox.

The OverridingURLClassLoader.java does the trick. This is just a URLClassLoader with one additional argument: the name of the class to redefine. This ClassLoader will enable one of your application classes to be re-defined in a ClassLoader sandbox and avoid Jar hell when deploying into any random environment.

That one application class could be defined by both the application ClassLoader and this new ClassLoader, and those two classes won't be compatible or assignable. An interface or base class could be shared by both though (and only defined by the application ClassLoader). The AntInstaller.java is using an interface to type the returned newInstance() object.

Here is the constructor that takes the name of a single class that should be re-defined by this ClassLoader (instead of the parent ClassLoader):

class OverridingURLClassLoader extends URLClassLoader {

String redefineClassName;

public OverridingURLClassLoader(String redefineClassName, URL[] urls, ClassLoader parentLoader) {
super(urls, parentLoader);
this.redefineClassName = redefineClassName;
}


The loadClass method checks cached classes, then classes defined in the list of URLs. If that fails, then before the parent is called a check for
if (name.equals(this.redefineClassName)) is done. If that class is being asked for, then it is re-defined using this instance of a ClassLoader:

public Class loadClass(String name) throws ClassNotFoundException {
Class c = findLoadedClass(name);
if (c == null) {
try {
c = findClass(name);

} catch (ClassNotFoundException e) {
if (name.equals(this.redefineClassName)) {
String path = name.replace('.', '/').concat(".class");
URL resource = getParent().getResource(path);
try {
byte[] bytes = toBytes(resource.openStream());
c = defineClass(name, bytes, 0, bytes.length);
} catch (IOException e1) {
throw new IllegalStateException("Can't get class definition", e1);
}
} else {
c = getParent().loadClass(name);
}

return c;
}
}

return c;
}

Wednesday, June 6, 2007

Why, What, How and programming.

I wanted to record this thought. I was working on some code that set a string value into a java.util.Map that was passed to Ant to do something else that resulted in a startup parameter for another system that.......

Hmm. Let's just say it was a little obtuse.

I've long been a believer and student of "What, not How" when it comes to requirements, design, and so on.

On this particular occasion, I realized the "Why" was really the most important, and missing!

So I now think we should:
  • Document "Why, not What"
  • Design for "What, now How"
  • and hope no one ever looks at "How" (along with sausage and law)

D'oh! REST already had contracts.

The contracts and protocol exist solely in the data. A client that understand a media type can then navigate the web of links based on knowledge gleaned from documents conforming to that media type. (This is exactly hypermedia as engine of application state).

First, Joe Gregorio's excellent post mentioning OpenSearch and offering advice on balancing 1 vs. n media types.

Second, Alan Dean posting this RDF shopping example.

Notice this paragraph in the shopping example:
To validate that the server supports the "Shop" protocol, the User Agent can test the document for the existence of the basket element, e.g. with the following XPath expression count(//shop:Basket)!=0

The shopping protocol is checked with a data expression!
The shopping protocol is checked with a data expression!

My trivial example is just an instance of using data for protocol. I still think there is value in using signposts as a metaphor for self-descriptive data; if only to help people lead from programming API to hypermedia.

Friday, June 1, 2007

Microformats, formats, and schemas

Jacek has some true words for me in Tim Bray's Blog:
It seems to me that John, in his eagerness to use microformats, reinvents the idea of a standardized structured format. We had that with SGML, we have that with XML. You don't need to use XML Schema to work with XML. In fact, (and Tim may confirm or deny this), XML seems like it's intended to be usable for these signposts John speaks of - recognized names would be my interpretation. Granted, microformats have their advantages and disadvantages over XML.
Jacek, I do know these things. I've worked with SGML and XML for years at Isogen, and Eliot Kimber taught me many important things about markup, including SGML Architectures (which interesting could provide an avenue to formalize microformats into fully defined markup languages). (Anyone have a link to SGML Architectural Forms?)

Thing is, I'm not really that interested in those details. Yes, they are important, but the distinctions between a DTD, RelaxNG, XML Schema, micro-format, and XLink don't make this issue any different.

On the topic of REST: only some commonly understood format is important. It's my personal opinion that any of the schemas I described in my previous post should be described in RelaxNG, XSD, and an HTML micro-format (with an SGML Architecture for good measure!).

My point is that multiple clients should be able to easily describe what must be followed and inspected, or what must be acted upon and triggered.

Jacek, I think we agree that IDL should not be dismissed. I'm trying to define a declarative definition that support better machine processing that fits into the constraints of the Web.

The Web has one Interface. Trying to write an IDL against that one interface (URI, HTTP Methods) is either very boring or is trying overlay domain methods onto HTTP methods.

I'm trying to find a way to describe web interfaces to machines. Hard coding URIs isn't good because of all the stuff people say. So, we add some special data (I'm not ready to say metadata) to some of the HTML forms to let a machine know which form should be used to buy something instead of logging out.