FME Evangelism #18: Spatial Data Validation, Nearest Neighbors, Oracle Connections

August 12th, 2008

Contents:
1) Spatial Data Validation…
2) …and the FME Server
3) NeighborFinder Update
4) Oracle Connections

Introduction
Hi FME’ers

Managing the quality of spatial data generally falls into two categories: data validation and data cleanup. Because FME is an ETL (Extract-Transform-Load) tool, we tend to concentrate on data validation only; data cleanup we see as more of an interactive process better handled in the source data’s native application.

But you can’t fix problems if you don’t know what or where they are located. So this post highlights a few recent updates on finding data problems with FME, and how to flag them. There’s an extremely useful update for the GeometryValidator transformer, plus a neat way of applying desktop validation models to the FME Server product.

Regards,

Spatial Data Validation…

Build 5584 of FME Desktop introduced a new feature to the GeometryValidator transformer. Previously it could tell you if a feature was valid or invalid (according to the OGC Simple Features standard), but now it also provides feedback on why features are invalid and where the problem occurs.

Below: This feature is obviously invalid and has been identified as such.

Below: Data Inspection tells us where and why the feature is invalid.

This new functionality really comes into its own when problems can’t be identified at a quick glance - for example reversed and duplicate segments in an area boundary.

Also, because the list of known issues is stable (and described in the documentation) it’s simple to add an AttributeFilter/Fanout to subdivide each failed feature into its own workflow/output.

…and the FME Server

A common use for spatial data validation is vetting data submissions; for example contractors submitting data to an employer, or companies submitting data to a government organization.

In either case the receiver is usually obliged to carry out data validation and return problem features to the originator for fixing.

Of course, the ideal process would be for the submitter to upload data online and get instant feedback on data quality. That way the receiver doesn’t have to carry out basic QA tests.

But that’s a huge operation to put in place, right? Wrong! Not if you have FME Server.

See this example on fmepedia. A workspace is created to validate a set of data, and write the results to a text file using basic html syntax. That workspace is uploaded to an FME Server and registered as a Data Streaming service.

And that’s it! Click the workspace URL, select the layer to process, and it opens up a web page report on the data:

The demo only allows you to select a layer to test. Converting this to select a dataset to upload and test is as simple as publishing the source dataset parameter. When you do that FME Server automatically prompts for the dataset and provides an upload option.

How do the results open automatically in a browser? With a new Textfile writer setting called Mime Type. Set it to Text/HTML when adding a Textfile writer and tag it as an FME Server streaming service - when run the output opens automatically in a web browser!

Admittedly this example only checks attribute values - it does no geometry validation. But combine this with the GeometryValidator example above, and you get a very powerful and useful validation tool.

As with the taxi demo from a few issue ago, you may have already seen something similar to this technique, but (once more) remember:

  • This is FME! Simple and quick. If you already use FME to validate data you just need to add a Textline writer to your workspace and upload it to FME Server. If you don’t already use FME, a basic validation workspace would take only minutes to create.
  • This is Spatial ETL! The demo is just a starting point. Now you can add more source data to carry out spatial validations - for example, is the submission within x metres of a right-of-way? - or you could even fix basic errors without having to return the data to the submitter!

NeighborFinder Update

The NeighborFinder transformer finds the nearest neighbor (candidate) for each original (base) feature.

There is a new example on fmepedia that demonstrates use of this transformer, but also highlights a new feature for FME2009: The Candidates First option.

Some users have mentioned that the NeighborFinder can be slow with large amounts of data, and the Candidates First option is our way of addressing that issue. It’s very similar to the Clipper transformer’s Clippers First option, and works like this:

  • To be sure of finding the nearest neighbor for a particular base feature, FME must have the full set of candidates available. Therefore as features arrive at the transformer, they are all cached to disk. When the flow of features is exhausted - ie FME is sure it has all base and candidate features, processing can take place.
  • However, FME doesn’t need the base features all at once. It only caches these because it can’t be sure there aren’t more candidates to come.
  • So, if we can force candidate features to arrive at the transformer first, and inform the transformer that this is the case, each base feature can be processed immediately, without checking to see if there are further candidates to come.

The benefits are improved performance - the process is usually faster and uses less memory - and the “Candidates First” option is how we inform the transformer that the first features to arrive are candidates.

Below: With the Candidates First option turned off this translation took 13.3 seconds and 96308kb memory.

2008-08-07 12:27:31|  13.3|  0.0|INFORM|Translation was SUCCESSFUL with 0 warning(s) (0 feature(s)/0 coordinate(s) output)
2008-08-07 12:27:31|  13.3|  0.0|INFORM|FME Session Duration: 13.2 seconds. (CPU: 12.0s user, 0.5s system)
2008-08-07 12:27:31|  13.3|  0.0|INFORM|END - ProcessID: 4368, peak process memory usage: 96308 kB, current process memory usage: 58552 kB.

Below: With the Candidates First option turned on this translation took 12.7 seconds and 56208kb memory.

2008-08-07 12:29:12|  12.7|  0.0|INFORM|Translation was SUCCESSFUL with 0 warning(s) (0 feature(s)/0 coordinate(s) output)
2008-08-07 12:29:12|  12.7|  0.0|INFORM|FME Session Duration: 12.8 seconds. (CPU: 11.5s user, 0.5s system)
2008-08-07 12:29:12|  12.7|  0.0|INFORM|END - ProcessID: 6940, peak process memory usage: 56208 kB, current process memory usage: 48244 kB.

So this option can make the transformer work faster and more efficiently. The above example is only a small set of data. Given millions of features this setting would be a real benefit and maybe the difference between a successful and failed translation.

But how do we ensure that candidates are indeed first? There are a number of ways, the most common being to re-order the readers (shown below) in the navigation pane (other methods being a Sorter transformer, or drawing workspace connections in a certain order).

Does that all make sense? If not the best analogy I can think of is interviewing job candidates. If you place a job ad saying “be here between 2pm and 5pm” then you will have to wait until 5pm because you’ll never know if there are any more candidates to come. But if you say “be here at 2pm” you’ll know that your work is complete when the last candidate is done, and you can go home early.

Don’t worry if you don’t understand the technicalities: just be aware of the possibility for using that setting, follow the example, and all should work out fine.

Did You Know? Oracle Connections

When FME attempts to connect to Oracle, it uses a small application called ora8list. If the connection fails you may get a message along the lines of:

The below command line failed to execute:
ora8list -allusers -object “<tablename>” “<database>” “<user>” C:\Temp\fme9999.tmp


This message doesn’t do much to explain the reason behind the connection failure (”infamously unhelpful” is how I’ve heard it described). However, at the same time FME also writes to a file called sqlnet.txt which is found in the <FME_HOME> directory.

This file includes the error messages which Oracle is returning on the failed connection. From these you can determine the ORA-xxxxx error number and then (hopefully) resolve the Oracle issue.

Thanks to Safe’s Robyn Rennie for that useful tip.

This Edition of the FME Evangelist…
…was written to the tune of Northern Sky by Nick Drake.

Pro Services colleague Mita was given a copy of his first album, Five Leaves Left, and I was horrified to find she hadn’t heard of him or listened to the album yet. So for anyone else in the same position, check out his music. I bet you’ll have heard some of it played in films or commercials.

The video is a bit cheesy since there is no live footage of him (his was the archetypal “tragic early death”) but his music and guitar playing are incredible: unusual tunings and cluster chords are not the half of it - let’s see you try that on Guitar Hero!

Other recommended songs are Place to Be and Pink Moon.

Entry Filed under: Data Transformation, Database, FME Desktop, FME Server, GIS, Geoweb

-->

Feed

Add to Technorati Favorites

The FME Evangelist

Welcome! The FME Evangelist delivers insider news, cutting edge examples and the latest functional developments for Safe Software’s FME application.

Find me (personally) on Google+

Find the FME Evangelist on Google+

Find Safe Software on Google+

Links

Archives

Categories

Tags