:-(: 午前, 午後, 夜, MeCab 分かち書き, 過去の記事が未来の記事を批判している, ,, MeCab で input-buffer overflow. The line is split. use -b #SIZE option. ..

2012-07-24 :-(

_ 午前

0550 起床

0800 出勤

0900 検討

_ 午後

1300 検討

1630 退勤

_ 夜

1830 pkgsrcほげ

1930 RR7

2100 飯。ゴーヤーちゃんぷるー

_ MeCab 分かち書き

> mecab.exe -O in.txt
writer.cpp(63) [!tmp.empty()] unkown format type [in.txt]

unknown ?

はて？

writer.cpp を眺める。

bool Writer::open(const Param &param) {
  const std::string ostyle = param.get<std::string>("output-format-type");
  write_ = &Writer::writeLattice;

  if (ostyle == "wakati") {
    write_ = &Writer::writeWakati;
  } else if (ostyle == "none") {
    write_ = &Writer::writeNone;
  } else if (ostyle == "dump") {
    write_ = &Writer::writeDump;
  } else if (ostyle == "em") {
    write_ = &Writer::writeEM;
  } else {
    // default values
    std::string node_format = "%m\\t%H\\n";
    std::string unk_format  = "%m\\t%H\\n";
    std::string bos_format  = "";
    std::string eos_format  = "EOS\\n";
    std::string eon_format  = "";

    std::string node_format_key = "node-format";
    std::string bos_format_key  = "bos-format";
    std::string eos_format_key  = "eos-format";
    std::string unk_format_key  = "unk-format";
    std::string eon_format_key  = "eon-format";

    if (!ostyle.empty()) {
      node_format_key += "-";
      node_format_key += ostyle;
      bos_format_key += "-";
      bos_format_key += ostyle;
      eos_format_key += "-";
      eos_format_key += ostyle;
      unk_format_key += "-";
      unk_format_key += ostyle;
      eon_format_key += "-";
      eon_format_key += ostyle;
      const std::string tmp = param.get<std::string>(node_format_key.c_str());
      CHECK_FALSE(!tmp.empty()) << "unkown format type [" << ostyle << "]";         ← writer.cpp(63) はここ
    }

あー

正解はこう。

> mecab.exe -O wakati in.txt

_ 過去の記事が未来の記事を批判している

実践！ IE：現場視点の品質管理（10）：品質管理に活用される主な統計的手法「特性要因図」 (1/2) - ＠IT MONOist 2012年01月31日

特性要因図を作成する場合には、第三者を多く含め、できるだけ広い範囲の人たちに参加してもらい、BS法を利用して少しでも影響のありそうな原因（要因）を可能な限り多く挙げて特性要因図に整理していくように努めなくてはなりません。

暮らしに役立つQC七つ道具(6) ―― 特性要因図：「原因」を「整理」する｜Tech Village （テックビレッジ）　／　CQ出版株式会社 2009年6月23日

「検討」としての使い方で，「ブレーン・ストーミングで要因を抽出して特性要因図を作成する」と書かれているものを見かけますが，それは特性要因図の使い方としては誤りです．

_ ,

天使

_ MeCab で input-buffer overflow. The line is split. use -b #SIZE option. が発生した

> mecab.exe hoge.txt

すると

input-buffer overflow. The line is split. use -b #SIZE option.

などと言われる。

help を眺める。

-b, --input-buffer-size=INT    set input buffer size (default 8192)

ということでとりあえず 10 倍しておく。

> mecab.exe -b 81920 hoge.txt

_ [タグクラウド][形態素解析][Mecab]Mecab で形態素解析してタグクラウドを作成する

MeCab インストール
辞書追加
解析

という流れ。

MeCab で分かち書きした結果を使おうとしたんだけど、助動詞などが混ざってしまうので、普通に形態素解析して「固有名詞」だけ抜き出すことにした。

環境:

Microsoft Windows 7 64bit
cygwin

インストール

インストール - Windows

バイナリをダウンロードしてインストール。

辞書の文字コードを UTF-8 にしてインストールしておく。

ユーザ辞書への追加

MeCab: 単語の追加方法

手順通りにおこなう。

Windows のコマンドプロンプトで作業。

適当なディレクトリに移動 (例: /home/foo/bar)

> cd C:\home\rin\work\lang\ruby\cloud

foo.csv というファイルを作成 & foo.csv に単語を追加

中身はこんな。スコアとかよく分かってないのでデタラメ。読みも無し。

 :
 :
ノーマルチャージ,1288,1288,6000,名詞,固有名詞,ニトロ,*,*,*,*,*,*
ノーマルチャージU,1288,1288,6000,名詞,固有名詞,ニトロ,*,*,*,*,*,*
ノーマルチャージB,1288,1288,6000,名詞,固有名詞,ニトロ,*,*,*,*,*,*
Seaside Route765,1288,1288,6000,名詞,固有名詞,コース,*,*,*,*,*,*
Seaside Route765 R,1288,1288,6000,名詞,固有名詞,コース,*,*,*,*,*,*
Rave City Riverfront,1288,1288,6000,名詞,固有名詞,コース,*,*,*,*,*,*
Rave City Riverfront R,1288,1288,6000,名詞,固有名詞,コース,*,*,*,*,*,*
ABEILLE,1288,1288,6000,名詞,固有名詞,マシン,*,*,*,*,*,*
BAYONET,1288,1288,6000,名詞,固有名詞,マシン,*,*,*,*,*,*
BISONTE,1288,1288,6000,名詞,固有名詞,マシン,*,*,*,*,*,*
CENTELLE,1288,1288,6000,名詞,固有名詞,マシン,*,*,*,*,*,*
EO,1288,1288,6000,名詞,固有名詞,マシン,*,*,*,*,*,*
ESPERANZA,1288,1288,6000,名詞,固有名詞,マシン,*,*,*,*,*,*
FATALITA,1288,1288,6000,名詞,固有名詞,マシン,*,*,*,*,*,*
 :
 :

辞書のコンパイル

foo.csv は UTF-8 にしておくこと。MeCab インストール時に辞書を UTF-8 にしておくこと。

> "C:\Program Files (x86)\MeCab\bin\mecab-dict-index.exe" -d"C:\Program Files (x86)\MeCab\dic\ipadic" -u foo.dic -f utf-8 -t utf-8 foo.csv

/home/foo/bar/foo.dic ができていることを確認

C:\home\rin\work\lang\ruby\cloud に foo.dic が作成される

/usr/local/lib/mecab/dic/ipadic/dicrc もしくは /usr/local/etc/mecabrc に以下を追加

Windows なので C:\Program Files (x86)\MeCab\dic\ipadic\dicrc に以下の行を追加

userdic = C:\home\rin\work\lang\ruby\cloud\foo.dic

解析

コード

#!/usr/bin/ruby -Ku

def analysis(inputfile)
  mecab_cmd = ' /cygdrive/c/Program\ Files\ \(x86\)/MeCab/bin/mecab.exe'
  text = `#{mecab_cmd} -b 81920 #{inputfile}`
  words = []
  lines = text.split("\n")
  lines.grep(/固有名詞/) {|line|
    words << line.split("\t")[0]
  }

  return words

end

# 解析結果の単語の出現数を単語ごとに数える
def tag(text)
  word_count ||= {}
  word_count.default = 0
  text.each { |w|
    word_count[w] += 1
  }

  return word_count

end

def html(contents)
  out_html = ""
  out_html << make_header()
  out_html << make_css()
  out_html << contents
  out_html << make_footer()
  return out_html
end


def make_css()
  css = ""
  css << "\t<style type=\"text/css\">\n"
  0.upto(24) { |level|
    font = 12 + level
    css << "\tli.tagcloud#{level} {font-size: #{font}px;}\n"
  }

  css << "\t.tagcloud {line-height:1}\n"
  css << "\t.tagcloud ul {list-style-type:none;}\n"
  css << "\t.tagcloud li {display:inline;}\n"
  css << "\t.tagcloud li a {text-decoration:none;}\n"
  css << "\t</style>\n"
  return css
end


def make_header()
  out_html = <<EOS
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>タグクラウド</title>
</head>
<body>
EOS
  return out_html
end

def make_footer()
  out_html = <<EOS
</body>
</html>
EOS

  return out_html
end


# 解析結果からタグクラウド作成
def tagcloud(tags)
  max_level = Math.sqrt(tags.values.max)
  min_level = Math.sqrt(tags.values.min)

  factor = 1.0
  if ((max_level - min_level) == 0)
    min_level = min_level - 24
    factor = 1
  else
    factor = 24 / (max_level - min_level)
  end

  tagcloud_html = ""
  tagcloud_html << "<ul class=\"tagcloud\">"

  tags.each { |tag, count|
    level = ((Math.sqrt(count.to_i) - min_level) * factor).to_i
    tagcloud_html << "<li class=\"tagcloud#{level}\">#{tag}</li>\n"
  }

  tagcloud_html << "</ul>"

  return tagcloud_html

end


def output(filepath, contents)
  File.open(filepath, "w").write(contents)
end

def build(infile, outfile)
  analyzed_text = analysis(infile)
  tags = tag(analyzed_text)
  tagcloud_html = tagcloud(tags)
  out_html = html(tagcloud_html)
  output(outfile, out_html)
end


def main(argv)
  infile = argv[0]
  outfile = argv[1]
  build(infile, outfile)
end

main(ARGV)

解析対象のテキストは ARC2011 - リッジレーサー7 の各レースのテキストを使用。

実行

% ruby cloud.rb in.txt out.html

結果

こんな感じになる。三嶋出雲と Downtown Rave City R がよく現れていることが分かる :-)

参考

本日のツッコミ(全3件) [ツッコミを入れる]

_ Fluxetin (2022-10-29 05:47)

Hi there, i read your blog occasionally and i own a similar one and i was just curious if you get a lot of spam responses? If so how do you prevent it, any plugin or anything you can suggest? I get so much lately it's driving me crazy so any assistance is very much appreciated.

_ みわ (2022-10-31 14:44)

Hi. Spam comes about once every six months. This blog uses "tdiary", and uses tdiary's plug-in to prevent spam. (tdiary is https://tdiary.org/)

_ buy institute diploma online (2025-02-01 12:00)

Kumusta! I'm delighted to have the chance to say hello. Browsing through your website felt like indulging in a lavish retreat in the Maldives, where every detail is carefully curated for relaxation and enjoyment. The refreshing design and engaging content create an inviting atmosphere for visitors. I am grateful for the opportunity to share my thoughts on the serene experience your site provides. Bon appetit


		2012年 7月
日	月	火	水	木	金	土

ヨタの日々

2012-07-24 :-(

_ 午前

_ 午後

_ 夜

_ MeCab 分かち書き

_ 過去の記事が未来の記事を批判している

_ ,

_ MeCab で input-buffer overflow. The line is split. use -b #SIZE option. が発生した

_ [タグクラウド][形態素解析][Mecab]Mecab で形態素解析してタグクラウドを作成する

インストール

ユーザ辞書への追加

適当なディレクトリに移動 (例: /home/foo/bar)

foo.csv というファイルを作成 & foo.csv に単語を追加

辞書のコンパイル

/home/foo/bar/foo.dic ができていることを確認

/usr/local/lib/mecab/dic/ipadic/dicrc もしくは /usr/local/etc/mecabrc に以下を追加

解析

結果

参考

プロフィール

サイト内検索

よく使うサービス

リッジレーサー7

ニコニコカレンダー

最新のツッコミ

読書メーター