An English-only Planet Iron Man

I'm very happy to know that Perl has global appeal from seeing all the non-English Perl blogs aggregated on Planet Iron Man, but since I'm a (typical American) monoglot, I'd prefer an Iron Man feed with only English articles. So I made one.

It's available at http://feeds.dagolden.com/ironman-english.xml. It updates hourly from the master feed.

And for the curious, or for anyone who wants to adapt this for other languages, here's the Perl program that I whipped-up to create the feed:

Update: I've also put the code up on Github: ironman-feedfilter

# feedfilter.pl - downloads and filters the Perl Ironman feed for English
# entries. Results sent to STDOUT.
#
# The heuristic filters out entries unless the content is mostly latin
# characters and English is close to the best guess of a language.  Short
# entries with code seem to confuse Lingua::Identify, so we take entries that
# seem "close-enough".  Tuned via trial-and-error.
#
# Copyright (c) 2010 by David Golden - This may be used or copied under the
# same terms as Perl itself.

use 5.008001;
use strict;
use warnings;
use utf8;
use autodie;

use IO::File;
use Lingua::Identify qw(:language_identification);
use Time::Piece;
use URI;

use XML::Atom::Feed;
$XML::Atom::ForceUnicode = 1;
$XML::Atom::DefaultVersion = "1.0";

# Global heuristic tuning
my $latin_target = 0.95;  # 95% latin chars
my $lang_fuzz = 0.02;     # English within 2% probability of best language

run();

#--------------------------------------------------------------------------#

sub latin_ratio {
  my $string = shift;
  my $alpha =()= $string =~ /(\p{Alphabetic})/g;
  my $latin =()= $string =~ /(\p{Latin})/g;
  
  return 0 if ! $latin || !$alpha; # !$alpha probably redundant
  return $latin / $alpha;
}

sub run {
  my $in_feed = XML::Atom::Feed->new(URI->new("http://ironman.enlightenedperl.org"));

  my $out_feed = XML::Atom::Feed->new;
  $out_feed->title("Planet Iron Man: English Edition");
  $out_feed->subtitle( $in_feed->subtitle );
  $out_feed->id("tag:feeds.dagolden.com,".gmtime->year().":ironman:english");
  $out_feed->generator("XML::Atom/" . XML::Atom->VERSION);
  $out_feed->updated( gmtime->datetime . "Z" );
  for my $l ( $in_feed->link ) {
    $out_feed->link($l);
  }

  for my $e ( $in_feed->entries ) {
    my $content = $e->content->body;
    my $latin = latin_ratio($content);
    my %lang = langof($content);
    my $best = [sort { $lang{$b} <=> $lang{$a} } keys %lang]->[0];
    $lang{en} ||= 0;
    $out_feed->add_entry($e)
      if $latin > $latin_target && ($lang{$best} - $lang{en} < $lang_fuzz);
  }

  binmode(STDOUT, ":utf8");
  print $out_feed->as_xml;
}
This entry was posted in perl programming and tagged , . Bookmark the permalink. Both comments and trackbacks are currently closed.

9 Comments

  1. andy.sh
    Posted February 6, 2010 at 7:28 am | Permalink

    You'd better use Google translate to process non-English posts and re-publish them in your personal feed.

  2. Posted February 6, 2010 at 9:19 am | Permalink

    The non-internationaling is that which will may kill the Perl. (:

  3. Zukoff
    Posted February 6, 2010 at 2:54 pm | Permalink

    Bad idea. I read also japaneese posts, everything is interesting. Better make auto-translation for those.

  4. Posted February 7, 2010 at 3:43 am | Permalink

    Awesome! I think this would be a nice feature to be built into the next-gen Catalyst-based version of the Iron Man stuff, but in the meantime this is great!

  5. Posted February 7, 2010 at 4:51 am | Permalink

    TIMTOWTDI.

    andy.sh, Naim, Zukoff: He's given you the code... Perhaps you can pick it up and make a google tranlations version?

    fREW: is there a repository which the Iron Man stuff is being worked on?

  6. Posted February 8, 2010 at 9:17 pm | Permalink

    No, this is not a bad idea. It is a great idea and I applaud you've created it. As much as I love to bring the community together, I think that there's a lot of posts there that a lot of people can't even read and some sort of boundaries should indeed be put in place.

    Besides English posts, I like to (and can) read the ones in Spanish but that's pretty much it. I basically skip immediately or don't even bother to look at what people write in Russian or Japanese (or whatever), just because I can't understand a single word they say. And unfortunately, automatic translators have always sucked ass so I wouldn't even bother on reading the translation.

  7. Posted February 28, 2010 at 11:01 am | Permalink

    As a fellow monoglot myself, I applaud and appreciate this effort. I unsubed from the regular feed and subbed to this instead!

    Now if we can only do something about the items that come across as raw HTML and look terrible in a feed reader.

  8. Posted March 1, 2010 at 1:28 am | Permalink

    yeah it seems all the blogger blogs are being spit out in raw html... no idea why.

    • dagolden
      Posted March 1, 2010 at 5:16 am | Permalink

      If I remember correctly, a bunch of those have a proper 'summary' entry, but the 'content' entry comes out raw. I didn't have the time to do more heuristics to flip them around and clean them up -- and I'd rather see that fixed upstream by the Iron Man feed anyway.

2 Trackbacks

© 2009-2014 David Golden All Rights Reserved