I'm very happy to know that Perl has global appeal from seeing all the non-English Perl blogs aggregated on Planet Iron Man, but since I'm a (typical American) monoglot, I'd prefer an Iron Man feed with only English articles. So I made one.
It's available at http://feeds.dagolden.com/ironman-english.xml. It updates hourly from the master feed.
And for the curious, or for anyone who wants to adapt this for other languages, here's the Perl program that I whipped-up to create the feed:
Update: I've also put the code up on Github: ironman-feedfilter
# feedfilter.pl - downloads and filters the Perl Ironman feed for English
# entries. Results sent to STDOUT.
#
# The heuristic filters out entries unless the content is mostly latin
# characters and English is close to the best guess of a language. Short
# entries with code seem to confuse Lingua::Identify, so we take entries that
# seem "close-enough". Tuned via trial-and-error.
#
# Copyright (c) 2010 by David Golden - This may be used or copied under the
# same terms as Perl itself.
use 5.008001;
use strict;
use warnings;
use utf8;
use autodie;
use IO::File;
use Lingua::Identify qw(:language_identification);
use Time::Piece;
use URI;
use XML::Atom::Feed;
$XML::Atom::ForceUnicode = 1;
$XML::Atom::DefaultVersion = "1.0";
# Global heuristic tuning
my $latin_target = 0.95; # 95% latin chars
my $lang_fuzz = 0.02; # English within 2% probability of best language
run();
#--------------------------------------------------------------------------#
sub latin_ratio {
my $string = shift;
my $alpha =()= $string =~ /(\p{Alphabetic})/g;
my $latin =()= $string =~ /(\p{Latin})/g;
return 0 if ! $latin || !$alpha; # !$alpha probably redundant
return $latin / $alpha;
}
sub run {
my $in_feed = XML::Atom::Feed->new(URI->new("http://ironman.enlightenedperl.org"));
my $out_feed = XML::Atom::Feed->new;
$out_feed->title("Planet Iron Man: English Edition");
$out_feed->subtitle( $in_feed->subtitle );
$out_feed->id("tag:feeds.dagolden.com,".gmtime->year().":ironman:english");
$out_feed->generator("XML::Atom/" . XML::Atom->VERSION);
$out_feed->updated( gmtime->datetime . "Z" );
for my $l ( $in_feed->link ) {
$out_feed->link($l);
}
for my $e ( $in_feed->entries ) {
my $content = $e->content->body;
my $latin = latin_ratio($content);
my %lang = langof($content);
my $best = [sort { $lang{$b} <=> $lang{$a} } keys %lang]->[0];
$lang{en} ||= 0;
$out_feed->add_entry($e)
if $latin > $latin_target && ($lang{$best} - $lang{en} < $lang_fuzz);
}
binmode(STDOUT, ":utf8");
print $out_feed->as_xml;
}
9 Comments
You'd better use Google translate to process non-English posts and re-publish them in your personal feed.
The non-internationaling is that which will may kill the Perl. (:
Bad idea. I read also japaneese posts, everything is interesting. Better make auto-translation for those.
Awesome! I think this would be a nice feature to be built into the next-gen Catalyst-based version of the Iron Man stuff, but in the meantime this is great!
TIMTOWTDI.
andy.sh, Naim, Zukoff: He's given you the code... Perhaps you can pick it up and make a google tranlations version?
fREW: is there a repository which the Iron Man stuff is being worked on?
No, this is not a bad idea. It is a great idea and I applaud you've created it. As much as I love to bring the community together, I think that there's a lot of posts there that a lot of people can't even read and some sort of boundaries should indeed be put in place.
Besides English posts, I like to (and can) read the ones in Spanish but that's pretty much it. I basically skip immediately or don't even bother to look at what people write in Russian or Japanese (or whatever), just because I can't understand a single word they say. And unfortunately, automatic translators have always sucked ass so I wouldn't even bother on reading the translation.
As a fellow monoglot myself, I applaud and appreciate this effort. I unsubed from the regular feed and subbed to this instead!
Now if we can only do something about the items that come across as raw HTML and look terrible in a feed reader.
yeah it seems all the blogger blogs are being spit out in raw html... no idea why.
If I remember correctly, a bunch of those have a proper 'summary' entry, but the 'content' entry comes out raw. I didn't have the time to do more heuristics to flip them around and clean them up -- and I'd rather see that fixed upstream by the Iron Man feed anyway.
2 Trackbacks
[...] Language tagging of posts to enable filtering and possible translation (See comments here and also here) [...]
[...] has provided a method for getting English-only Ironman feeds. I like seeing lots of different languages in my browser window when I go to the Ironman site [...]