Penggunaan Kamus Singkatan Kata Bahasa Indonesia Sehari-Hari dalam Pembangkitan Fitur Teks

Authors

  • Citra Lestari Universitas Ciputra Surabaya
  • Kenny Jihiro Universitas Ciputra Surabaya
  • Andreas Lim Universitas Ciputra Surabaya
  • Daniel Aprilio Universitas Ciputra Surabaya
  • Franciscus Valentinus Universitas Ciputra Surabaya

DOI:

https://doi.org/10.32493/informatika.v8i2.29306

Keywords:

Indoenesian informal abbreviation, dictionary, lemming, features generation

Abstract

Natural Language Processing (NLP) research on Indonesian language is relatively slow compared to other languages, such as English or Chinese. Most of the researches are dealing with Indonesian formal textes. Some NLP researches that are dealing with Indoensian informal texts are having quite difficulty since Indonesian informal language usually combines formal language, daily language, and local language. In addition, there is a habit in Indoensians to use abbreviation in texting. These cause great difficulty in features generation process, where machines fail to identify stopwords and form lemmas from the bag of words. There are actually dictionaries that can be used to do lemming process for Indonesian forma language, daily language, local languages, and even Indoensian formal abbrevations. But there is stil no dictionary for Indoensian informal abbrevations. This research made an Indonesian informal abbrevations dictionary from 4000 Indonesian tweets.  The dictionary contains 706 unique abbrevations as its corpus. The dictionary then used to generate features. In this research, the features generation only used this dictionary to measure its signiicancy. The feature generation with the Indonesian informal abbrevations dictionary were tested with Indonesian tweets about Covid-19 Vaccine. The features generation process was able to identify 2262 abbrevations wotj 71,09% of them were identified as stopwords. To take a further step, the features generated then being tested to figure out their impact in sentiment analysis. The sentiment analysis used Multi-Layer Perceptron. Unfortunately, those features didn’t increase the performance of the sentiment analysis. The accuracy decreased by 3,5% while the precision, recall, and F1-Score decreased in range of 0,02 – 0,04. With this result, it can be concluded that the use of this dictionaty alone for lemming process is not enough. It needs to be combined with other dictionary to have more optimal result.

References

Elcholiqi, A. & Musdholifah, A. (2020). Chatbot in Bahasa Indonesia using NLP to Provide Banking Information. Indonesian Journal of Computing and Cybermetrics Systems. 14(1). https://doi.org/10.22146/ijccs.41289

Floridi, L. & Chiriatti, M. (2020). GPT-3: Its Nature, Scope, Limits, and Consequence. Minds & Machines 30, 681–694. https://doi.org/10.1007/s11023-020-09548-1

Ibrahim, M.O. & Budi, I. (2018). A Dataset and Preliminaries Study for Abusive Language Detection in Indonesian Social Media. Procedia Computer Science, Vol 135 p. 222-229. 1877-0509 https://doi.org/10.1016/j.procs.2018.08.169

Kameswari, A.V.N. (2021). Image Caption Generator Using Deep Learning. International Journal for Research in Applied Science and Engineering Technology Vol 9 (10) pp. 1554-1564.

Lesson, W., Resnick, A., Alexander, D., & Rovers, J. (2019). Natural Language Processing (NLP) in Qualitative Public Health Research: A Proof of Concept Study. International Journal of Qualitative Methods vol 18 p. 1-9. https://doi.org/10.1177%2F1609406919887021

Lestari, C., Saputri, T.R.D., & Siahaan, S.C.P. (2022). Analisis Sentimen Pandangan Netizen Indonesia terhadap Vaksin COVID-19 menggunakan Multi-Layer Perceptron. Jurnal Teknik Informatika dan Sistem Informasi vol. 9(4).

Nayoga, B.P., Adipradana, R., Suryadi, R., & Suhartono D. (2021). Hoax Analyzer for Indonesian News Using Deep Learning Models. Procedia Computer Science vol 179 p. 704-712. 1877-0509 https://doi.org/10.1016/j.procs.2021.01.059

Nazeer, I., Rashid, M., Gupta, Dr. S., & Kumar, A. (2020). Use of Novel Ensemble Machine Learning Approach for Social Media Sentiment Analysis. Analyzing Global Social Media Consumption, IGI Global.

Ratnasari, C.I., Kusumadewi, S., & Rosita, L. (2014). Model Natural Language Processing untuk Perumusan Keluhan Pasien. Seminar Nasional Informatika Medis (SNIMed) V.

Salsabila, N.A., Winatmoko. Y,A,, Septiandri, A.A., & Jamal, A. (2018). Colloquial Indonesian Lexicon. International Conference on Asian Language Processing (IALP) p. 226-229. 10.1109/IALP.2018.8629151.

Zaky, D. (2019). Kumpulan Kata Bahasa Indonesia. Github: https://github.com/damzaky/kumpulan-kata-bahasa-indonesia-KBBI diakses pada tanggal 26 Juli 2022.

Downloads

Published

2023-06-30