An English-only Planet Iron Man

I'm very happy to know that Perl has global appeal from seeing all the non-English Perl blogs aggregated on Planet Iron Man, but since I'm a (typical American) monoglot, I'd prefer an Iron Man feed with only English articles. So I made one.

It's available at It updates hourly from the master feed.

And for the curious, or for anyone who wants to adapt this for other languages, here's the Perl program that I whipped-up to create the feed:

Update: I've also put the code up on Github: ironman-feedfilter

# - downloads and filters the Perl Ironman feed for English
# entries. Results sent to STDOUT.
# The heuristic filters out entries unless the content is mostly latin
# characters and English is close to the best guess of a language.  Short
# entries with code seem to confuse Lingua::Identify, so we take entries that
# seem "close-enough".  Tuned via trial-and-error.
# Copyright (c) 2010 by David Golden - This may be used or copied under the
# same terms as Perl itself.

use 5.008001;
use strict;
use warnings;
use utf8;
use autodie;

use IO::File;
use Lingua::Identify qw(:language_identification);
use Time::Piece;
use URI;

use XML::Atom::Feed;
$XML::Atom::ForceUnicode = 1;
$XML::Atom::DefaultVersion = "1.0";

# Global heuristic tuning
my $latin_target = 0.95;  # 95% latin chars
my $lang_fuzz = 0.02;     # English within 2% probability of best language



sub latin_ratio {
  my $string = shift;
  my $alpha =()= $string =~ /(\p{Alphabetic})/g;
  my $latin =()= $string =~ /(\p{Latin})/g;
  return 0 if ! $latin || !$alpha; # !$alpha probably redundant
  return $latin / $alpha;

sub run {
  my $in_feed = XML::Atom::Feed->new(URI->new(""));

  my $out_feed = XML::Atom::Feed->new;
  $out_feed->title("Planet Iron Man: English Edition");
  $out_feed->subtitle( $in_feed->subtitle );
  $out_feed->generator("XML::Atom/" . XML::Atom->VERSION);
  $out_feed->updated( gmtime->datetime . "Z" );
  for my $l ( $in_feed->link ) {

  for my $e ( $in_feed->entries ) {
    my $content = $e->content->body;
    my $latin = latin_ratio($content);
    my %lang = langof($content);
    my $best = [sort { $lang{$b} <=> $lang{$a} } keys %lang]->[0];
    $lang{en} ||= 0;
      if $latin > $latin_target && ($lang{$best} - $lang{en} < $lang_fuzz);

  binmode(STDOUT, ":utf8");
  print $out_feed->as_xml;
This entry was posted in perl programming and tagged , . Bookmark the permalink. Both comments and trackbacks are currently closed.


    Posted February 6, 2010 at 7:28 am | Permalink

    You'd better use Google translate to process non-English posts and re-publish them in your personal feed.

  2. Posted February 6, 2010 at 9:19 am | Permalink

    The non-internationaling is that which will may kill the Perl. (:

  3. Zukoff
    Posted February 6, 2010 at 2:54 pm | Permalink

    Bad idea. I read also japaneese posts, everything is interesting. Better make auto-translation for those.

  4. Posted February 7, 2010 at 3:43 am | Permalink

    Awesome! I think this would be a nice feature to be built into the next-gen Catalyst-based version of the Iron Man stuff, but in the meantime this is great!

  5. Posted February 7, 2010 at 4:51 am | Permalink

    TIMTOWTDI., Naim, Zukoff: He's given you the code... Perhaps you can pick it up and make a google tranlations version?

    fREW: is there a repository which the Iron Man stuff is being worked on?

  6. Posted February 8, 2010 at 9:17 pm | Permalink

    No, this is not a bad idea. It is a great idea and I applaud you've created it. As much as I love to bring the community together, I think that there's a lot of posts there that a lot of people can't even read and some sort of boundaries should indeed be put in place.

    Besides English posts, I like to (and can) read the ones in Spanish but that's pretty much it. I basically skip immediately or don't even bother to look at what people write in Russian or Japanese (or whatever), just because I can't understand a single word they say. And unfortunately, automatic translators have always sucked ass so I wouldn't even bother on reading the translation.

  7. Posted February 28, 2010 at 11:01 am | Permalink

    As a fellow monoglot myself, I applaud and appreciate this effort. I unsubed from the regular feed and subbed to this instead!

    Now if we can only do something about the items that come across as raw HTML and look terrible in a feed reader.

  8. Posted March 1, 2010 at 1:28 am | Permalink

    yeah it seems all the blogger blogs are being spit out in raw html... no idea why.

    • dagolden
      Posted March 1, 2010 at 5:16 am | Permalink

      If I remember correctly, a bunch of those have a proper 'summary' entry, but the 'content' entry comes out raw. I didn't have the time to do more heuristics to flip them around and clean them up -- and I'd rather see that fixed upstream by the Iron Man feed anyway.

2 Trackbacks