『♪ 自由に生きていく方法なんて、1通りだってないさ～』

ebata

5年前

The other day I mentioned a little about Q-Learnings reinforcement learning program.

"1600億個の「状態」を、8400万個の状態まで減縮する方法"を見つけて、さらに、今はテストの為に65535個の状態まで小さくして、24時間ぶっつづけでプログラムを走らせています。

I am running the program for 24 hours, after finding a way to reduce the 160 billion "states" to 84 million states and furthermore, reducing it to 65535 for testing now,

-----

シミュレーションの中で学習中のオブジェクトが至った状態を調べたら、わずか

When I examined the state of the object under learning in the simulation, it was only

―― 482個

482

でした。

つまり、

In other words,

0.735% の状態にしか至っていなかったのです。

it had reached only 0.735%.

-----

これ、私には、かなりの衝撃でした。

This was a great shock to me.

現在、プログラムの中の私のオブジェクトは、1回のトライアルを2秒以内で完了できるようにしていますので、これまで4万回のトライアルが終わっています。

Currently, my object in the program is able to complete one trial in less than 2 seconds, so 40,000 trials have been completed.

1回のトライアルで、約1万2000回程の状態変異をしていますので、これまで、5億回以上の状態変化を繰り返しているはずです。

As the object had about 12,000 state mutations in one trial, it should have repeated over 500 million state changes so far.

私のオブジェクトは、"65535"という少数とは比較にならない数の、状態を試しているハズなのです。

My object should have been trying out a number of states more than the small number "65535".

Q学習をご存知の方は、「学習収束の結果だろう」と思うかもしれませんが、今回のケース、乱数の選択率を90%に設定しているので、「収束」は理由になりません。

If you know Q learning, you might think, "It will be the result of learning convergence," however, In this case, I set the random number selectivity to 90%, so the "convergence" was not a reason.

----

ところで、

By the way,

『♪ 自由に生きていく方法なんて、100通りだってあるさ～ (*)』

"There are 100 ways to live freely ... (*)"

(*)「風を感じて」/浜田省吾

(*) "Feel the wind" / Shogo Hamada

という歌が、カップラーメンのCMで流れていた時、私は中学生でした。

I was a junior high school student when the song was being played in a cup noodle commercial.

その時でさえ、私は、

Even then, I said

『バカ言ってんじゃねーよ』

"Don't say stupid!"

『10通りでもいいから、スペックアウト(書き出し)てみろよ』

"Try to write just ten ways down!"

という、「中二病」とはちょっと違った方向で、ひねくれていました。

as a trickster, in a slightly different direction against others.

-----

今回のQ-Learningによる強化学習プログラムで、改めて分ったことは、

In this reinforcement learning program with Q-Learning, what I found again is

(1)歌手の浜田省吾さんは、人生の状態空間(ここで言う"65535"通り」について唄っています

(1)Hamada Shogo-san talks about the state space of life ("65535" street said here)

しかし、

however,

(2)彼は、状態の遷移における位置的制約または時間的制約の概念を、完全に無視している

He completely ignores the notion of positional constraints or temporal constraints in state transitions

ということです。

今回、私の作った私のオブジェクトは、制約のない自由行動を5億回も許されていました。

This time, my object I created was allowed 500 million free actions without restrictions.

しかし、たった500の状態にも至れなかったのです。

However, it did not reach the state of only 500.

----

ここから考えると、浜田省吾さんは、

From this viewpoint, Hamada Shogo-san should have sung

『♪ 自由に生きていく方法なんて、1通りだってないさ～』

"There is no way to live freely ..."

と唄うべきであり、それこそが、ティーンエイジャに対する正しいメッセージであったと言えます。

and, it was the absolute correct message for teenagers.

実際のところ、このシミュレーションの結果は、現実世界の状況とも、恐しいほど合致していると思います。

As a matter of fact, I think the results of this program are in line with the real-world situation.